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Preface 


In Science, technology, and mathematics, a network is a system of intercon- 
nected objects. Complex network analysis (CNA) is a discipline of exploring 
quantitative relationships in the networks with non-trivial, irregular structure. 
The actual nature of the networks (social, semantic, transportation, commu- 
nication, economic, and the like) doesn’t matter, as long as their organization 
doesn’t reveal any specific patterns. This book was inspired by a decade of 
CNA praefice and research. 

Being a professor of mathematics and computer Science at Suffolk University 
in Boston, I have experimented with complex networks of various sizes, pur- 
poses, and origins. I developed my first CNA Software in an ad hoc manner 
in the C language—the language venerable yet ill-suited for CNA projects. 
The price of explicit memoiy management, cumbersome file input/output, 
and lack of advanced built-m data structures (such as maps and lists) was 
simply too high to justify a further commitment to C. At the moment I realized 
that there were affordable altematives to C that did not require low-level 
programming (such as Pqjek [NMBl 1 ] and Mathematica^), off I went. 

Both Systems that I mentioned had significant restrictions. Mathematica was 
proprietaiy (and, frankly, quite costly). My inner open source advocate 
demanded that I cease and desist ustng it, especially given that earher versions 
of Mathematica didn’t provide dedicated CNA support and failed to handle 
big networks. Pajek was proprietary, too, and not programmable. It took a 
joint effort of my inner open source advocate and inner programmer to push 
it to the periphery. (I stili occasionally use Pajek, and I believe it’s a great 
System for solving non-recurrmg problems.) 

I felt delighted when, tn search of open source, free, scalable, reliable, and 
programmable CNA Software, I ran into NetworkX, a Python llbraiy stili in its 
Infancy. For the next several years, it became my tool of choice when it came 
to CNA simulation, analysis, or visualization. 


1. www.wolfram.com/mathematica 
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About the Reader 

Thls book Is intended for graduate and undergraduate students, complex 
data analysis (CNA) or social network analysis (SNA) instructors, and CNA/SNA 
researchers and practitioners. The book assumes that you have some back- 
ground in computer programming—^namely, in P 3 dhon programming. It expects 
from you no more than common sense knowledge of complex networks. The 
mtention is to build up your CNA programming skills and at the same time 
educate you about the elements of CNA itself. If you’re an experienced Python 
programmer, you can devote more attentiori to the CNA techniques. On the 
contraiy, if you’re a network analyst with less than an excellent background 
in Python programming, your plan should be to move slowly through the dark 
woods of data frames and list comprehensions and use your CNA intuition 
to grasp programming concepts. 

About the Book 

This book covers construction, exploration, analysis, and visualization of 
complex networks using NetworkX (a Python libraiy), as well as several other 
Python modules, and Gephi, an Interactive environment for network analysts. 
The book is not an introduction to Python. I assume that you already know 
the language, at least at the level of a freshman programming course. 

The book consists of five parts, each covermg specifLc aspects of complex 
networks. Each part comes with one or more detailed case studies. 

Part I presents an overview of the main P 3 d;hon CNA modules: NetworkX, iGraph, 
graph-tool, and networkit. It then goes over the construction of very simple net¬ 
works both programmatically (using NetworkX) and interactively (in Gephi), and 
it concludes by presenting a network of Wikipedia pages related to complex 
networks. 

In Part II, you’ll look into networks based on expllcit relationshlps (such as 
social networks and communication networks). This part addresses advanced 
network construction and measurement techniques. The capstone case study 
—a network of “Panama papers”—illustrates possible money-laundering pat- 
terns in Central Asia. 

Networks based on spatial and temporal co-occurrences—such as semantic 
and product networks—are the subject of Part III. The third part also explores 
macroscopic and mesoscopic complex network structure. It paves the way to 
network-based cultural domain analysis and a marketing study of Sephora 
cosmetic products. 
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If you cannot find any dlrect or indirect relationships between the Items, but 
stili would like to build a network of them, the eontents of Part PV eome to 
the rescue. You will leam how to find out if items are similar, and you will 
convert quantitative similarities into network edges. A network of psyeholog- 
ical trauma types is one of the outeomes of the fourth part. 

The book eoneludes with Part V: direeted networks with plenty of examples, 
ineluding a network of qualitative adjectives that you could use in eomputer 
games or fietion. 

When you finish your joumey, you’ll be able to identify, sketeh (both by hand, 
m Gephi, and programmatieally), transform, analyze, and visualize several 
types of eomplex networks. You’ll be able to interpret network measures and 
strueture. The book doesn’t aim to be a comprehensive CNA reference. Many 
discipline-specific aspects, such as triadic census, exponential random graph 
models (ERGMs), and network flows, as well as the whole stoiy of network 
dynamics (evolution and contagion), have been intentionally left uncharted. 
The bibhography on page 215 will take you to more destmations ofyour choice, 
whether they be economic networks, web scrapping, or classical social network 
analysis. 

About the Software 

This book uses Python 3.x and networkx 1.11. AU Python examples in this book 
are known to work for the modules mentioned in the following table. AU of these 
modules are included in the Anaconda distribution, with the exception of commu- 
nity,^ toposort,^ wikipedia,"* and generalized,® which must be mstalled separately. 
Anaconda is provided by Continuum Analytics and is available for free.® 


Package 

Used version 

Package 

Used version 

python 

3.4.5 

networkx 

1.11 

matplotiib 

1.5.1 

community 

0.9 

nitk 

3.2.2 

numpy 

1.11.3 

pandas 

0.19.2 

pygraphviz 

1.3.1 

wikipedia 

1.4 

scipy 

0.18.1 

toposort 

1.5 




2. pypi.python.org/pypi/python-louvain 

3. pypi.python.org/pypi/toposort 

4. pypi.python.org/pypi/wikipedia 

5. pragprog.com/titles/dzcnapy/source_code 

6. www.continuum.io 
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The easiest way to install the missing modules is by running pip on your 
operating system shell command line. 

plp install toposort 
pip install wikipedia 
pip install python-louvain 
pip install pygraphviz 

Ifyou want to use module pygraphviz to layout networks, you first need to install 
Graphviz (tncluding the developers add-on graphviz-dev).^ 

In September 2017, a new version of NetworkX was released, NetworkX 2.0. 
Appendix 2, NetworkX 2.0, on page 213 provides useful information about 
converting your CNA seripts to the new version. 

About the Notatiori 

The following covers the specific notation used in this book. 

Program Output 

The book uses a left-pointed gray arrow in the left margin of a page to indicate 
program outputs. In the following scenario, print(l + 2) is a Pyrthon statement, 
and 3 is the visual output of the statement. 

printd + 2) 

3 


This chapter uses X 


"This Chapter Uses X" 

“This chapter/section uses X” informs you that the material 
in the chapter or section goes beyond the core Python and 
NetworkX. If you’re unfamiliar with X, youTl probably understand the content 
but may experience difficulties with comprehending the included code snip- 
pets. You’re advised to refresh your knowledge of the listed modules. 


Directed Edges 

NetworkX uses module Matplotiib for network vlsualization. You would expect 
directed edges to have an arrow at the head end, and Matplotiib fully supports 
arrows. However, NetworkX draws thick rectangular stubs instead. This is just 
something youTl have to get used to. If you need a publication-quality network 
image with arrows, consider using Gephi. 


7. www.graphviz.org/ 
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Online Resources 

Thls book has its own web page® where you can find all the code for this book. 
There you’ll also find the eommunity forum, where you can ask questions, 
post comments, and submit errata. 

Two other great community-operated resources for questions and answers 
are the Stack Overflow forum® and NetworkX Google discussion group.^® 

Now, let's get started! 

Dmitry Zinoviev 

dzinoviev@gmail.com 
Januaiy 2018 


8. pragprog.com/book/dzcnapy 

9. stackoverflow.com/questions/tagged/networkx 

10. groups.google.eom/forum/#lforum/networkx-discuss 
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When allyou have is a hammer, everything looks like a nail. 
Proverb 


CHAPTER 1 

TheArtofSeeing Networks 

Complex network analysis (CNA) is a rapidly expanding discipline that studies 
how to recognize, describe, analyze, and visualize complex networks. The 
Python libraiy NetworkX provides a collection of functions for constructing, 
measuring, and drawing complex networks. We’ll see in this book how CNA 
and NetworkX work together to automate mundane and tedious CNA tasks and 
make it possible to study complex networks of vaiying sizes and at vaiying 
levels of detail. 

At this point, you may be wondeiing what a network is, why some networks 
are complex, why it is important to recognize, describe, analyze, and visualize 
them, and why the discipline is expanding right now instead of having 
expanded, say, a hundred years ago. If you’re not, then you’re probably a 
seasoned complex network researcher, and you may want to skip the rest of 
this chapter and proceed to the CNA and Python technicalities (Chapter 2, 
Surveying the Tools ojthe Crqft, on page 1 1). Otherwise, stay with us! 

Complex networks, like mathematics, physics, and biology, have been in exis- 
tence for at least as long as we humans have. Biological complex networks, in 
fact, predate humankind. However, intensive studies of complex networks did 
not start until the late ISOOs to early 1900s, mostly because of the lack of 
proper mathematical apparatus (graph theoiy, in the first place) and adequate 
computational tools. The reason for the explosion of CNA research and applica- 
tions in the late 1900s-early 2000s is two-fold. On the “supply” side, it is the 
availability of cheap and powerful computers and the abundance of researchers 
with advanced training in mathematics, physics, and social Sciences. On the 
“demand” side, it is the ever increasing complexity of social, behavioral, biolog¬ 
ical, financial, and technological (to name a few) aspects of humanity. 

In this chapter, you will see different types and kinds of networks (including 
complex networks) and learn why networks are important and why it is worth 
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seeing them around. You will be able to spot complex networks, capture them 
—so far, wlthout any Software—and get some sense about their useful prop- 
erties (again, with no Software necessaiy). When you see the limitations of 
the paper-and-pencil method, you will be ready to dive into the computerized 
proper complex network analysis. 

Know Thy Networks 

In general, a network is yet another—relational—form of organization and 
representation of discrete data. (The other one being tabular, with the data 
organized in rows and columns.) Two important network concepts are entities 
and the relationships between them. Depending on a researehehs baekground, 
entities are known as nodes (the term we’ll use in this book), actors, or ver¬ 
tices. Relationships are known as edges (preferred m this book), links, ares, 
or connections. We will casually refer to networks as “graphs” (in the graph- 
theoretical meaning of the word), even though graphs are not the only way 
to describe networks. 



Graphs and Graphs 

When it comes to mathematies, the word “graph” has at least two 
different meanings. In algebra and caleulus, a graph of a function 
is a eontinuous line chart or surfaee plot. In graph theoiy, a graph 
is a set of discrete objects (vertices, depicted diagrammatically as 
dots), possibly Joined by edges (depicted as lines or ares). We will 
always use the latter definition unless explieitly stated. 


Network nodes and edges are high-level abstractions. For many t 5 rpes of net¬ 
work analysis, their true nature is not essential. (When it is, we deeorate 
nodes and edges by addtng properties, also known as attributos.) What matters 
is the disereteness of the entities and the binarity of the relationships. A dis¬ 
crete entity must be separable from all other entities—otherwise, it is not 
ciear how to represent it as a node. A relationship t 3 rplcally involves two dis¬ 
crete entities: in other words, any two entities either are in a relationship or 
not. (An entity ean be tu a relationship with itself. Sueh a relationship is ealled 
rejlexive.) It is not directly possible to use networks to model relationships 
that involve more than two entities, but if sueh modeling is really neeessaiy, 
then you ean use h 3 rpergraphs, which are beyond the scope of this book. 


Once all of the above conditions are met, you ean graphieally represent and 
vlsualize a node as a point or circle and an edge as a Ime or arc segment. You 
can further express node and edge attiibutes by adding line thickness, eoior, 
different shapes and sizes, and the like. 
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Let’s have a look at some really basic—so-called “classic”—networks. 

In a checkerboard, eachjleld is an entity (node) with three attributes: “color" 
(“black” or white”), “column” (“A” through “H”), and “row” (1 through 8). 
“Being next to” is the relationship between two entities. There is an edge 
connecting two nodes if the nodes “are next to” eaeh other. As a matter 
of fact, “being next to” is one of the foundational relationships that leads 
to spatial networks. You can see a “cheekerboard” network, also known 
as a mesh or grid, in the followmg figure. 



In a timeline of our Ife, each Ufe event (such as “btrth, ” “high school graduation, ” 
“marriage, ” and eventually “death”) is an entity with at least one attribute: 
“time. ” “Happenmg immediately after” is the relationship: an edge eonneets 
two events if one event occurs immediately after the other, leading to a 
network of events. Unlike “being next to,” “happening immediately after” 
is not symmetric: if A happened immediately after B (there is an edge from 
A to B), then B did not happen after A (there is no reverse edge). 

In afamily tree, each person in the tree is an entity, and the relationship could 
be either being “a descendant of’ or “an ancestor of (asymmetric). A 
family tree network is neither spatial nor strietly temporal: the nodes are 
not intrinsically arranged in space or time. 


report erratum • discuss 



Chapter 1. The Art of Seeing Networks • 4 


In a hierarchical system that consists of parts, sub-parts, and sub-sub-parts 
(such as this book), a part at any level of the hierarchy is an entity. The 
relationship between the entities is “a part of’: a paragraph is “a part of’ 
a subseetion, whieh is “a part of’ a seetion, whieh is “a part of’ a ehapter, 
which is “a part of’ a book. 

All the networks bsted previously are simple beeause they have a regular or 
almost regular strueture. A cheekerboard is a rectangular grid. A timeline is a 
bnear network. A family tree is a tree, and such is a network of a hierarchical 
System (a special case of a tree with just one level of branches is caUed a star). 
The followtng figure shows more simple networks: a linear timeline of Abraham 
Lincoln (A.L.), his family tree, and a ring of months m a year. (A ring is another 
simple network, which is essentially a linear network, wrapped around.) 



Make no mistake: a simple network is simple not beeause it is small, but 
beeause it is regular. For example, any ring node always has two neighbors; 
any tree node (except for the root) has exactly one antecedent: any inner grid 
node has exactly four neighbors, two of which are in the same row and the 
other two in the same column. The complete world timeline has billions of 
events. The humankind “family tree” has billions of individuals. We stili con- 
sider these networks simple. 

What is a complex network, then? 
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A complex network has a non-trivial structure. It is not a grid, not a tree, not a 
ring—but it is not entlrely random, either. Complex networks emerge m nature 
and the man-made world as a resuit of decentralized processes with no global 
control. One of the most common mechanisms is the preferential attachment 
(Emergence qf Scaling in Random Networks [BA99]), whereby nodes with more 
edges get even more edges, forming gigantic hubs in the core, surrounded by 
the poorly cormected peripheiy. Another evolutionaiy mechanism is transitive 
closure, which connects two nodes together if they are already cormected to a 
common neighbor, leadlng to densely interconnected network neighborhoods. 

Let’s glance at some complex networks. The following table shows the major 
classes of complex networks and some representatives from each class. 


Technological networks 

Biological/ ecological 
networks 

Economic networks 

Social networks 

Cultural networks 


Communication systems; transportation; the 
Internet; electiic grid; water mains 
Food webs; gene/protem interactions; neural 
System; disease epidemics 

Financial transactions; corporate partnerships; 
International trade; market basket analysis 
Families and friends; emall/SMS exchanges; 
professional groups 

Language families; semantic networks; literature, 
art, histoiy, religion networks (emerging ftelds) 


The networks tn the table pertaln to diverse physical, social, and informational 
aspects of human llfe. They consist of various nodes and edges, some materlal 
and some purely abstract. However, all of them have common properties and 
behaviors that can be found in complex networks and only in complex net¬ 
works, such as community structure, evolution by preferential attachment, 
and power law degree distribution. 


Enter Complex Network Analysis 

Complex network analysis (CNA), which is the study of complex networks— 
their structure, properties, and dynamics—is a relatively new dlscipline, but 
with a rich histoiy. 

You can think of CNA as a generalization of social network analysis (SNA) to 
Include non-social networks. 


Social networks—descriptors of social structures through interactions—have 
been known as “social groups” since the late 1890s. Their systematic explo- 
ration began in the 1930s. In 1934, J.L. Moreno (Who Shall Surviue? [Mor34]) 
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developed sociograms—graph drawlngs of soclal networks. Eventually, 
sociograms became the de facto Standard of complex network visualization. 

John Bames colned the term “SNA” In 1954 {Class and Committees in a 
Nonvegian Island Parish [Bar54J). Around the same time, rapid penetration 
of mathematica! methods Into social Sciences began, leadlng to the emergence 
of SNA as one of the leading paradigms in contemporaiy sociology. 

Social network analysis addresses social networks at three levels: microscopic, 
mesoscopic, and macroscopic. At the microscopic level, we view a network as 
an assembly of individual nodes, dyads (patrs of connected nodes; essentially, 
edges), triads (triples of nodes, connected in a triangular way), and subsets 
(tightly knit groups of nodes). A mesoscopic view focuses on exponential random 
graph models (ERGMs), scale-free and small-world networks, and network 
evolution. Finally, at the macroscopic level, the more genera! complex network 
analysis fuUy absorbs SNA, abstracting from the socia! origins of social networks 
and concentrating on the properties of very large real-world graphs, such as 
degree distribution, assortativlty, and hierarchical structure (Exploring Complex 
Networks [StrOlJ)- You will see the definitions and explanations of some of 
these properties and the Python ways of calculating them later in the book. 

But first, let’s get your hands dirty (possibly physically dirty) and sketch a 
real complex network on a sheet of paper. 

Draw Your First Network with Paper and Peneii 

Just like networks with regular topology, complex networks are not necessar- 
ily large. In fact, they are not even “complex” in the colloquial meaning of the 
Word. We can easily spot them without any specialized hardware or Software; 
a pair of inquisitive eyes, a sheet of paper, and a pencil often suffice. 

As a proof of concept, let’s do an exercise in network construction (just con- 
struction, no analysis so far!). We are deeply convinced that complex networks 
are eveiywhere; rephrasing the quote, incorrectly attributed to Michelangelo, 
“all we have to do is to chip away eveiything that is not a complex network.” 

All people on Earth, including current and prospective complex network 
analysts, deserve healthy nutrition. To help them build a balanced diet in 
an utterly networked way, you will use a list of foods that provide naturally 
occurring nutrients.^ The data on the website is somewhat contradictory. 


1 . The document was orlglnally found at www.sharecare.com/health/nutrition-diet/which-foods-naturally- 
occurring-nutrients but does not seem to be there anymore; it is cached as nutrients.txt at 
pragprog.com/book/dzcnapy. 
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as Is often the case wlth real-world data. For example, in one list item, the 
authors refer to “shellfish,” and m another, to “seafood.” It is not ciear if 
freshwater cra 3 dish is meant to be “seafood” or not, but let us not worry about 
the striet biological taxonomy and make reasonable assumptions, whenever 
necessaiy. 

Your first step is to identify discrete entities. The dataset has two potential 
candidates for entities (and, therefore, network nodes); foods (such as fish 
and eggs) and nutrients (such as vitamins A and C). You could construet a 
network of foods or a network of nutrients. However, you can shoot two blrds 
with one stone and create a network of both nutrients and foods (a so-called 
bipartite network—more on them in Chapter 15, Harnessing Bipartite Net¬ 
works, on page 175). The nodes will be of two types, but don’t worry about this 
heterogeneity now. 

The relationship between digestive items is described by the verb “provides” 
or “is provided”: certain food X provides nutrients Yl, Y2, and so on, and 
certain nutrient Y is provided by eertain foods XI, X2, and so on. 

Now, take a sheet of paper and a peneil and transcribe the list of food and 
nutrient items into a network, as follows: 

1. Choose the first nutrient from the list—say, it is vitamin D. Draw a eirele 
that represents vitamin D and label it “D.” 

2. Vitamin D is provided by fatty fish; draw a circle that represents fatty 
fish, label it “fatty fish,” and cormect to the “D” node. 

3. Vitamin D is also provided by mushrooms; draw a eirele that represents 
mushrooms, label it “mushrooms,” and conneet to the “D” node. 

4. Repeat the previous steps for each combination of food types and nutri¬ 
ents. Do not duplieate nodes! If a nutrient is provided by the food type 
that already has a node, conneet the nutrient to the existmg node. 

The method of starting with a “seed” node and following the edges to discover 
other nodes is ealled snowball sampling (“snowballing”). Your network starts 
as a single snowflake and grows over time until either you are happy with its 
size or there is no more “snow” to add. Beware: snowballing may overlook 
small and medium-size network chunks if you choose an improper seed. To 
mitigate potential problems in networks that eonsist of several disjointed 
parts (so-called uneonneeted graphs), it might be best to select several seeds 
and follow all edges originating from them. 
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By the way, congratulatlons! You just created your first complex network! 
(Apologies if it was not your first.) Does it look like the following figure? 



Is it error prone? Absolutely! 

Is the network drawing ugly? Most likely! 

But don’t woriy. You will see how to automate the network construction 
process soon (in Construet a Simple Network with NetworkX, on page 17 and 
Import and Modify a Simple Network with Gephi, on page 32) . In this chapter, 
you leamed the simple theory of complex networks and quick-and-dlrty paper 
and pencil network construction tricks. Let’s proceed to the overvlew of Python 
and non-Python power tools for network construction and analysis. 
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Part I 

Elementary Networks and Tools 


Even rudimentanj automation of complex network 
analysis leads to significant performance improve- 
ment. The results are even more impressive when 
you deal with many similar networks that have to 
be analyzed in a similar way. In this part, you will 
acquire elementary CNA automation skills. 




M. Worsaae ofCopenhagen, who has been followed by other anti- 
quaries, has even gone so far as to divide the natura! history ofcivi- 
lization into three epochs, according to the character ofthe tools 
used in each. 

SamueI Smiles, Scottish author and government reformer 

CHAPTER 2 

Surveying the Tools ofthe Craft 

The most common P 5 hhon tools for manipulating and processing networks 
are NetworkX, iGraph, graph-tool, and networkit. The modules make it possihle to 
construet complex networks from non-network data, analyze and vlsualize 
the networks, and convert the analysis results into non-network data struc¬ 
turos (such as dictionaries, lists, and Pandas DataFrames)—in other words, emhed 
CNA into the general-purpose Software development workflow. In this chapter, 
you will look at the four modules and compare their strong and weak points. 
You will be able to decide if NetworkX, the module that we use in the rest of the 
book, is right for your problem. (If not, you can stili read the coding-agnostic 
part of the book!) You will be ready to tackle NetworkX and write code to con¬ 
struet “simple” complex networks. 

Do Not Weave Your Own Networks 

Being a Python programmer, it’s often tempting to disregard exlsting CNA 
modules (especlally since they’re not part of the language core) and produce 
roll-your-own CNA code. For example, you could represent a network as a 
list of edges (an edge list) or as a dictionaiy with nodes and edges. You could 
spend a fortune designing an efficient data structure for the internal network 
representatlon. But then the real job begins: implementing dozens of network 
construction, serialization, deserialization, and analysis algorithms, followed 
by aesthetically appealing, presentation-quality network vlsualization. 

Even if you’re the right person for this task, as a complex network analyst, you 
want the tools of the craft now, not in weeks or months. You want tools that 
are bug-free, efficient, and well-documented. You want tools with a broad user 
base from whom you can seek support and consolation. That is why 1 encourage 
you to give up your ambitious pians of building your own CNA suite (if you ever 
had such pians, of course) and conslder one of the exlsting Ubraries. 
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^ Where to Get Help 

Speaklng of support and consolation, the primary source of it for a desperate program- 
mer Is the Q&A site StackOverflow.® The number of mdividual tags on it attests to the 
popularity of a Ubrary. At the time thls book was wrltten, iGraph had 1,940 posts (but 
only 411 for the fython version); NetworkX was not too far behlnd wlth 1,711 posts. The 
two hbraries that support distiibuted processtng via OpenMP—graph-tool and NetworKit— 
were trailmg wlth 151 and 21 posts. The latter one does not even have a speclfic tag! 
Looks llke people who work wlth huge networks know thelr way around. 


a. stackoverflow.com 



If you’re impatient to see some P 3 d:hon code. Appendix 1, Network Construction, 
Five Ways, on page 209 shows how to construet the same network—the Lincoln 
graph on page 4 —in pure P 5 dhon and the four packages that you will come 
to know next. 

Glance at iGraph 

The library iGraph (properly spelled as iGraph, but imported as igraph) is an open 
source and free “collection of network analysis tools wlth the emphasis on 
efficiency, portabillty, and ease of use”.^ NetworkX (the tool of choice in the 
book) and iGraph are stmcturally similar, yet have their unique features and 
algorithms. 

One of the most notable features of iGraph is its availability as a C language 
library with the bindings in C, P 5 rthon, and R. The R appllcation programmlng 
interface (API) makes the package a better altemative for the network analysts 
who have been trained as R programmers. The choice of C as the implemen- 
tation language also makes the package two orders of magnitude faster than 
the comparable P 3 rthon-only packages. 

Let’s go over the list of iGraph’s other niceties. To begin with, the module pro¬ 
vides a convenient way to instantiate a network from an edge list—something 
not critical for most projects, but stili nice icing on the cake. 

iGraph natively supports node clustering (community detection), which is 
essential for complex network analysis. While community detection is available 
for NetworkX through a third-party module (see Outline Modularity-Based 
Communities, on page 136), you may feel more comfortable when it’s an Integral 
part of your tool chest. 


1. igraph.org 
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iGraph has a smart built-in search mechanism. A programmer can call methods 
.selecto and .findO with complex queries as parameters to locate nodes and 
edges, based on their attribute values. 

The iGraph drawtng subsystem supports a variety of graph layout algorithms that 
broadly expand presentational opportunities. Once agaln, NetworkX is capable of 
similarly complex and perhaps even better charting, but it relies on graphviz 
[Hamess Graphviz, on page 28) or simllar extemal programs for node placement. 
You must install the programs separately, and their versions must match the 
version of NetworkX. iGraph spares you from the version-matchmg miseiy. 

Last but not least, iGraph is 10-50 times faster than NetworkX. We already men- 
tioned this fact before, but it is worth mentioning again. The success or failure 
of your large CNA prqject may depend on whether you can fitiish the analysis 
by the deadline (if at all). If you have a network of more than a hundred thousand 
nodes, NetworkX may not be your best friend. (Hint: you can stili try NetworKit!) 

The benefits of iGraph far outweigh its flaws, but it is not flawless. First and fore- 
most, installing the package requires a C compiler and takes considerable time. 

Another downside of iGraph is the way it handles nodes (iGraph developers refer 
to nodes as “vertices”) and edges. You can add an edge to a network only if 
both edge ends have already been added, which is not always desirable. 
Intemally, iGraph Stores edges in an indexed list. Any addition or removal of 
an edge triggers a costly reindexing; for instance, adding a hundred edges 
one at a time takes roughly a hundred times longer than adding the same 
edges from a list. What is worse, the removal of edges and nodes changes 
their indexes. If network elements are removed in a loop, node and edge 
indexes may be invalidated by prior removals. 

Appreciate the Power of graph-tool 

graph-tool developers posltion the module as having “a level of performance 
that is comparable (both in memoiy usage and computation time) to that of 
a pure C/C++ libraiy.”^ Just like in the case of iGraph, the performance boost 
comes from implementing the whole module in C/C++. 

Once successfully installed, graph-tool shines. For starters, it is based on the 
OpenMP protocol that supports shared memoiy multiprocessing program- 
ming.® A graph-tool program is capable of using all CPUs and cores available 
to your System. Many CNA tasks (such as PageRank and betweenness 


2. graph-tool.skewed.de 

3. www.openmp.org/ 
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calculation) are easily parallelizable: they can be split into N subtasks, so that 
each one is executed by a CPU or core, reducing the total running time by 
the factor of up to N. (Note that even wlthout OpenMP, graph-tool is stili the 
fastest Python CNA libraiy of the four libraries considered.) If you feel your 
network is too large and CNAjobs take too much time to run, adding another 
CPU may help. 

Here’s a list of some of graph-toofs most prominent features: 

• Excellent support for drawlng. graph-tool supports a variety of layouts and 
output formats. It can use its built-in layout and visualization engines or 
rely on the extemal graphviz package (you will meet it in Hamess Graphuiz, 
on page 28). 

• Extended graph statistics calculation tools that spare you from relpng 
on other statistical modules. 

• Built-in community detection and improved blockmodeling (see Perform 
Blockmodeling, on page 138) algorithms. Combined wlth the OpenMP par- 
allelization, this feature makes graph-tool a really serious community 
detection engine. 

• Graph filtermg and graph views. Slictng (Slice Weighted Networks, on page 
79) is a common task tn complex network construction, whereby a decision 
to keep or dlscard an edge is made based on its weight. graph-tool allows 
you to imitate a sliced network graph without really removlng the 
unworthy edges, but by temporarily hiding (“filtering”) them, based on 
their attributes and other properties. The new Virtual graphs can be saved 
as “views” and later analyzed and visualized as if they were genuine net- 
works of their own. 

graph-toofs superb performance comes at the cost of increased time and 
memory required during installation and compilation. My experience shows 
that installing graph-tool is not even always feasible. If you happen to run Linux 
and you have not updated it for a while (because “if it ain’t broke, don’t fix 
it”), then chances are you will have to compile the module from the C/C++ 
source code. Some Python programmers hate challenges like that. 

graph-toofs other major deficiency is in how it handles nodes (graph-tool calls 
nodes “vertices.”). Unlike other network analysis libraries, graph-tool nodes do 
not have names. Instead, they have contiguous indexes. (Presumably because 
graph-tool makes little effort to disguise its C/C++ heritage!) Adding a node to 
a graph does not affect existing indexes. However, removing a node (unless 


report erratum • discuss 








Accept NetworkX • 15 


it is the node with the largest index) invalidates some or ali exlsting indexes 
and may cause mysterious errors, especially in loops. 

Another inconvenienee of graph-tool is related to the separation of nodes and 
edges and their attributes, mcluding node names/labels. The attributes are 
stored in dictionaiy-style data struetures called property maps. The node 
name is just one of the attributes. The programmer is responsible for keeping 
the node list and the attribute lists in a consistent state. This second-class 
treatment of node labeis makes graph-tool more suitable for general graph 
analysis, rather than for the analysis of real world-tnspired complex networks. 

To summarize: graph-tool is a terrific module, though not without issues. 
Someone should write a book about it, too. 

Accept NetworkX 

NetworkX is indeed the tool of the craft, at least for this book, and 1 would like 
to justify my choice. 

The main winning points of NetworkX are fourfold: 

• NetworkX is painless to install. Since it is written in pure P 5 rthon, it requires 
no compilation. 

• NetworkX has excellent online documentation, far superior to that of IGraph 
and graph-tool. Besides, it has an active community of supporters on 
StackOverilow. 

• NetworkX’s structure (functions, algorithms, and attributes) are in good 
agreement with CNA tasks. 

• NetworkX’s performance is acceptable up to about 100,000 nodes. 

While it is true that NetworkX lacks some essential features (such as community 
detection and advanced vlsualizatlon layouts), you can easlly add these fea¬ 
tures by tnstaUing Python-only third-party modules. We wiU use some of these 
modules in this book. 

Keep in Mind NetworKit 

NetworKIt is another great libraiy that supports parallelized network Process¬ 
ing. NetworkX and NetworKit are compatlble at the graph level. If you are in a 
rush, construet a network in NetworkX and convert it to NetworKit for further 
analysis. 


4. stackoverflow.com/questions/tagged/networkx 
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Compare the Toolkits 

The followlng table contains a side-by-side comparison of the toolkits men- 
tioned m the previous sections. The relative slowdown value shows how much 
slower the tool is compared to the fastest tool in the collection (which, inei- 
dentally, is graph-tool). 



graph-tool 

iGraph 

NetworkX 

NetworKit 

Implementation language 

C/C++ 

C/C++ 

Python 

C/C++ 

Language bindings 

Python 

C, Python, R 

Python 

C-H-, P3rthon 

Installation effort 

Hard 

Medium 

Easy 

Medium 

OpenMP support 

Yes 

No 

No 

Yes 

Relative slowdown® 

1 

1-4 

40-135 

N/A 

Built-in community 

Yes 

Yes 

No 

Yes 

detection 





Built-in advanced 

Yes 

Yes 

No 

Yes 

layouts 






In this chapter, we compared four of the most popular CNA tools written in 
Python and available for free. YouVe got to admit that NetworkX does not nec- 
essarily look like the best tool. However, it is the most easily installable, the 
most robust, and the most well documented. It is stili a noble and venerable 
toolset, and we will stick with it. Turn to the next chapter—and you will find 
out how to create your first NetworkX-based complex network. 


5. graph-tool.skewed.de/performance 
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The resuit is a most extraordinary looking creature, a network of 
worms with numerousheads, each branch being eventuallyprovided 
with one ofits own. 

B. Lindsay, American biologist and writer 


chapterS 

Introducing NetworkX 

Any network starts with one node, and we can add more nodes and edges to 
it, as needed. The attributes of those nodes and edges describe thetr properties. 
The node, edge, and attribute data come from other data structures or files. 

In this chapter, you will leam NetworkX functions for starting a new network, 
populating it with nodes and edges, and decorating them with attributes. You 
will also leam how to create a “quick and drrty” visualization of the constmcted 
network (we will give you more powerful visualization tools later in Chapter 
4, Introducing Gephi, on page 31). 

In many cases, complex network analysis is an iterative process, whereby the 
network grows, shrinks, or undergoes other transformations over time. You 
will leam how to preserve a complex network as a disk file in a variety of 
popular formats (some of which you can later import into Gephi, an Interactive 
network analysis tool) and how to read data from appropriate files into the 
NetworkX representation. 

We will use the following terminology throughout the book to refer to the 
relationships between nodes and edges: 

• A node is incident to an edge if it is the start or end of the edge. The edge, 
respectively, is incident to its end nodes. 

• Two nodes are adjacent rf they are incident to the same edge. 

• Two edges are adjacent if they are incident to the same node. 

Construet a Simple Network with NetworkX 

A NetworkX project begins with importing module networkx (usually under the 
name nx). 

import networkx as nx 
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Create a Graph 

A NetworkX network is a collectiori of edges and labeled nodes. The libraiy 
allows you to use any hashable Python data as a node label (different labeis 
within the same graph may belong to different data t 5 rpes). To create a new 
network graph, you must choose an appropriate graph type and call the 
respective constructor; pass either no parameters (for an empty graph) or a 
list of edges as node pairs (lists or tuples). NetworkX supports four graph types: 

• Undirected graphs consist only of undirected edges—edges that can be 
traversed in either direction so that an edge from A to B is the same as 
an edge from B to A. Mathematically, undirected graphs represent S 5 mi- 
metric relationships: If A is in a relationship with B, then B is also in a 
relationship with A. For example, sistership and companionship are 
S5mimetric relationships, but “belng in love with” is not (at least, not 
always). Create an empty undirected graph with the constructor nx.GraphO: 

G = nx.Graph{) 

Undirected graphs can have self-loops—edges that start and end at the 
same node. Mathematically, self-loops represent a reflexive relationship: 
A is in a relationship with itself. If an undirected graph does not have 
self-loops, it is called a simple graph. A graph that is not simple is called 
a pseudograph. 

• Directed graphs, also known as digraphs, have at least one directed edge. 
“Being the father of ’ is a symmetric relationship and would be represented 
by a directed edge. You would use a directed graph for a family network 
that shows fathership and mothership. Create an empty directed graph 
with the constructor nx.DiGraphO: 

G = nx.DiGraph() 

Many NetworkX algorithms refuse to calculate with digraphs. You can con- 
vert a digraph into an undirected graph. All directed edges become undi¬ 
rected, and all pairs of two reclprocal edges become single edges. However, 
remember that the original digraph and the derived undirected graph are 
different. 

F = nx.Graph(G) # F is undirected 

• Multigraphs are like undirected graphs, but they can have parallel edges 
—multiple edges between the same nodes. Parallel edges may represent 
different types of relationships between the nodes. For example, Alice may 
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be a classmate of Bob, but she also may be his fiiend. Create an empty 
multigraph with the constructor nx.MultiGraphO: 

G = nx.MultiGraphl) 

• Finally, directed multigraphs are what they say they are: directed graphs 
with parallel edges. Create an empty directed multigraph with the con¬ 
structor nx.MultiDiGraphO: 

G = nx.MultiDiGraphO 

Chapter 17, Directed Networks, on page 197 in this book is dedicated to 
directed networks. Until we get there, unless said otherwise, let’s assume 
that all our networks are undirected and don’t have parallel edges, but possibly 
have self-loops. 

Add and Remove Nodes and Edges 

NetworkX provides several mechanisms for adding nodes and edges to an 
existing graph: one by one, from a list or another graph. Likewise, you can 
remove nodes or edges one by one or by using a list. Node and edge manipu- 
lations are subject to the following rules: 

• Adding an edge to a graph also ensures that its ends are added if they 
did not exist before. 

• Adding a duplicate node or edge is silently ignored unless the graph is a 
multigraph; in the latter case, an additional parallel edge is created. 

• Removtng an edge does not remove its end nodes. 

• Removtng a node removes all incident edges. 

• Removing a stngle non-existent node or edge raises a NetworkXError exception, 
but if the node or edge is a part of a list, then an error is silently ignored. 

Let’s use the data collected in Draw Your First Network with Paper and Peneii, 
on page 6 to build the same network of foods and nutrients programmatically, 
using all node addition techniques mentioned previously: 

G = nx.Graph( [ { "/i" , "eggs"),]) 

G.add_node( "spinach" ) # Add a single node 
G.add_node( "Hg" ) # Add a single node by mistake 

G.add_nodes_f rom( ["foiates" , "asparagus" , "ii\/er"]) # Add a list of nodes 
G.add_edge( "spinach" , "folates") # Add one edge, both ends exist 
G.add_edge{ "spinach" , "heating oil") # Add one edge by mistake 
G. add_edge{ " lii/er" , "Se") # Add one edge, one end does not exist 
G.add_edges_from([(" folates" , "liver"), {“folates" , “asparagus")]) 


report erratum • discuss 





ChapterS. Introducing NetworkX • 20 


We intentionally added several inedible nodes—just to illustrate tiow one ean 
remove unwanted fragments: 

G.remove_node( “Hg" ) 

G.remove_nodes_from([ "Hg" ,]) # Safe to remove a missing node using a list 
G. remove_edge( "spinae/?" , “heating oil") 

G. remove_edges_f rom( [ ( "spinac/7" , "heating oli"), ]) # See above 
G. remove_node("/7eating oil") # Not removed yet 

You ean use the method G.clearO to delete all graph nodes and edges at once 
but keep the graph shell. 

Lookat Edge and Node Lists 

NetworkX provides several options for exploring the node and edge lists. Graph 
object attributes (not to be confused with network attributos in AddAttributes, 
on page 23) G.node and G.edge store all nodes and edges, respectively, in the 
form of dictionaries. Node labeis are the keys of G.node. Node attributes, in the 
form of nested dietionaries, are the values. Since we did not assign any 
attributes to the nodes yet, the dictionaries are empty. 

print(G.node) 

< {'Se': {}, 'eggs': {}, 'asparagus': {}, 'A': {}, 'liver': {}, 'spinach': {}, 

'folates': {}} 

Stari node labeis are the keys of G.edge, too. Each dictionary value corresponds 
to one edge (also in the form of a dictionary), where the keys are end node 
labeis, and the values are edge attribute dictionaries. 

print(G.edge) 

< {'Se': {'liver': {}}, 'eggs': {'A': {}}, 'asparagus': {'folates': {}}, 

'A': {'eggs': {}}, 'liver': {'Se': {}, 'folates': {}}, 

'spinach': {'folates': {}}, 

'folates': {'asparagus': {}, 'liver': {}, 'spinach': {}}} 

The second option is to call methods G.nodesO and G.edgesO. (Mind the s at the 
end!) If called wlthout any parameters, the methods retum node and edge hsts. 

print(G.nodes() ) 

< ['Se', 'eggs', 'asparagus', 'A', 'liver', 'spinach', 'folates'] 
print(G.edges() ) 

< [('Se', 'liver'), ('eggs', 'A'), ('asparagus', 'folates'), 

('liver', 'folates'), ('spinach', 'folates')] 

Note NetworkX created edge attribute accessors G.edge['Se']['liver'] and G.edge['liver']['Se'], 
but the edge ('Se', 'liver') does not have the reverse counterpart ('liver', 'Se'). 
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If called with the optional parameter data=True, the methods return the lists 
with the additlonal attribute dietionaiies. 

print(G. nodes(data=True)) 

< [('Se', {}), ('eggs', {}), ('asparagus', {}), ('A', {}), ('liver', {}), 

('spinach', {}), ('folates', {})] 

print(G.edges(data=True) ) 

< [('Se', 'liver', {}), ('eggs', 'A', {}), ('asparagus', 'folates', {}), 

('liver', 'folates', {}), ('spinach', 'folates', {})] 

You can measure the length of the returned lists or dictionaries to find out 
the number of nodes and edges. Additionally, funetion len(G) returns the 
number of nodes in G. 

Read a Network from a CSV File 

The toy code fragment on page 19 has at least three problems. First, it is a 
toy. Second, it is incomplete. Third, it is not flexible: any change in the origtnal 
network requires that you rewrite the code. 

Ideally, you would record a network in a file (ustng some popular data format, 
such as comma-separated values, or CSV). You would then write a program 
that reads the network data from the file and constructs a Graph object. NetworkX 
has an excellent collection of file readers and writers, but let’s pretend it does 
not and implement a CSV edge llst reader. Our nutrients and foods are in the 
file nutrients.CSV. ^ The first ten and the last five lines of the file are shown below: 

A, carrots 
A, eggs 

A,"fatty fish" 

A,"green leafy vegs" 

A,liver 

A,milk 

A,tomatoes 

B12,milk 

B6,asparagus 

B6,beans 

«more pairs» 

shellfish,Se 

thiamin,"whole grains" 

tomatoes,tomatoes 

"veg oils",E 

yogurt,Ca 


1. pragprog.com/titles/dzcnapy/source_code 
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Now, watch the magic of Python. It takes only three lines of code to open the 
edge llst file, create a CSV reader for the file, and “suck” the list of pairs into 
the Graph constructor. 

nutrients.py 

import networkx as nx 

import matplotlib.pyplot as plt 

import dzcnapy_plotlib as dzcnapy 

import CSV 

with open( "nutrients. CSV" ) as infile: 
csvreader = csv.reader(lnfile) 

G = nx.Graph(csv_reader) 
print(G.nodes() ) 

< ['B6', 'wheat', 'nuts', 'beef, 'cheese', 'milk', 'B12', 'E', 'thiamin', 

'liver', 'legumes', 'broccoli', 'C', 'folates', 'yogurt', 'tomatoes', 

'veg oils', 'riboflavin', 'beans', 'mushrooms', 'D', 'spinach', 'shellfish', 
'niacin', 'A', 'fatty fish', 'Se', 'Mn', 'green leafy vegs', 'poultry', 
'pumpkins', 'Cu', 'whole grains', 'Zn', 'eggs', 'carrots', 'asparagus', 

'potatoes', 'Ca', 'kidneys', 'seeds'] 

The provided edge list in the file nutrients.csv has an intentional inconsistency: 
an edge that connects the node “tomatoes” with itself, a self-loop. You can 
remove the self-loops by first identifying them with G.selfloop edgesO and then 
passing the loop edges to G.remove_edges_from(): 

nutrients.py 

loops = G.selfloopedges() 

G.remove_edges_f rom(loops) 
print(loops) 

< [('tomatoes', 'tomatoes')] 

loops = G.selfloop_edges() 
print(loops) # No more loops 

< [] 

Relabel Nodes 

The network looks magnificent, but there is one more thing we can do to make 
it better: capitalize ali node names. NetworkX provides method nx.relabeI nodesO 
that takes a graph and a dictionaiy of old and new labeis and either creates 
a relabeled copy of the graph (copy=True, default) or modifies the graph in place 
(use the latter option if the graph is large and you don’t plan to keep the 
original graph). Each dictionary key must be an existing node label, but some 
labeis may be missing. The respective nodes will not be relabeled. 
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We will use dlctionaiy comprehension to walk through ali network nodes and 
convert those labeled with strings to the title case (capitalize the first letter 
of each word). 

nutrients.py 

mapping = {node: node.titleO for node in G if isinstance(node, str)} 
nx.relabel_nodes(G, mapping, copy=False) 
print(G.nodes() ) 

< ['B6', 'Yogurt', 'Legumes', 'Tomatoes', 'Potatoes', 'Wheat', 'Eggs', 

'Veg Oils', 'D', 'Pumpkins', 'Poultry', 'Kidneys', 'Liver', 'Broccoli', 

'Zn', 'Carrots', 'Whole Gralns', 'Folates', 'Niacin', 'Nuts', 'Seeds', 

'Mn', 'C, 'Mushrooms', 'Shellfish', 'B12', 'Cheese', ' Fatty Fish', 'E', 
'Thiamin', ' Riboflavin' , 'A', 'Green Leafy Vegs', 'Se', 'Beef, 

'Asparagus', 'Milk', 'Cu', 'Spinach', 'Beans', 'Ca'] 

Note that G in the previous code fragment acts as a node iterator. In fact, G 
has some other dict() features. For example, you can use selection operator [] 
to access the edges incident to the node, and their attributes: 

print(G[ "Zn" ]) 

< {'Liver': {}, 'Nuts': {}, 'Beef: {}, 'Beans': {}, 'Poultry': {}, 

'Potatoes': {}, 'Kidneys': {}} 

But wait, the network does not have any attributes yet. You may want to add 
them now. 

Add Attributes 

A node or edge attribute describes its non-structural properties. For example, 
edge attributes may represent weight, strength, or throughput. Node attributes 
may represent edge, color, size, or gender. NetworkX provides mechanisms for 
setting, changing, and comparing attributes. 

An attribute is implemented as a dictionaiy associated with the node or edge. 
The dictionaiy keys are attribute names. As such, they must be immutable: 
int(), floatO, boolO, str(), and so on. There are no limitations on the values. You 
can create a node whose attribute is the node itself, except that this exercise 
is utterly pointless. 

NetworkX offers three options for setting node and edge attributes. 

• Define attributes at the time of adding nodes or edges: 

G.add_node( “Honey" , edible=True) 

G.add_nodes_from([( "Steel" , {"edible" : False}), ]) 

G. add_edge( "fioney" , "Steel", weight=0.0) 

G.add_edges_from([( "Honey" , "Zn"),], related=False) 
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NetworkX and other CNA llbraries consistently use edge attribute "weight" to 
denote edge strength (as in Distinguish Strong and Weak Ties, on page 
66). A graph whose edges have this attribute is called a weighted graph. 
There is a method G.add_weighted_edges_from() for adding weighted edges. 

G.add_weighted_edges_from([( "Honey" , “Zn" , 0.01), 

("Honey" , "Sugar", 0.99)]) 

Note that when you add several edges with G.add_edges_from(), you ean 
specify only one set of attributes for all of them. If thafs not what you 
want, use the next option. 

• Define or change an attribute of existing nodes and edges by ealling 
nx.set_node_attributes() or nx.set_edge_attributes(): 

nx.set_node_attributes(G, attname, nodedict) 
nx.set_edge_attributes(G, att_name, edge_dict) 

Here, att name is the name of the affeeted attribute, node dict/edge dict is a die- 
tionary whose keys are existing node labeis or edge pairs, and values are 
attribute values for the respeetive nodes/edges. If the attribute doesn’t exist 
yet, it’s ereated; otherwise, the value of the existing attribute is ehanged. If 
a key isn’t a node label or edge pair, the methods raise a KeyError exeeption. 

• Define or change an attribute of individual existing nodes and edges 
directly through the dictionaiy interfaces G.node (indexed by node labeis) 
and G.edge (double indexed by start and end node labeis): 

G.node["Zn"] ["nutrient"] = True # Zinc is a nutrient 

G.edge["Zn"] ["Beet"] ['Veig/it"] = 0.95 # Zinc and beet are well connected 

The dictionaiy interface allows you to remove unwanted attributes: 

dei G.node["Zn"] ["nutrient"] 
dei G.edge["Zn"] ["Beet"] [“weight"] 

Regardless of how you create an attribute, it ean be modified using any of 
the three options listed previously. 

In our little food and nutrition exercise, we have nodes of two tyipes: foods 
and nutrients. Labeling them for future analysis would be helpful. Let’s create 
a boolean attribute "nutrient" that is true for nutrients and false for foods. The 
Information about node tyrpe was not in the original dataset. 

nutrients.py 

nutrients = set(("B12", "Zn", "D", "B6", "4", "Se", "Cu", “Folates" , 

"Ca", "Mn", “Thiamin" , “Riboflavin" , "C", “E", “Niacin")) 
nutrient_dict = {node: (node in nutrients) for node in G} 
nx.set_node_attributes(G, “nutrient" , nutrientdict) 
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Ready, Set(), Go 

Python programmers often use collections of items solely for 
lookup. They are interested in whether an item is in the collection 
or not, not in where in the collection it is. Pjhhon lists have noto- 
riously slow (linear) lookup performance. Whenever posslble, 
convert them Into sets that have constant lookup time. 

All nodes have been successfully labeled: 

print(G. nodes(data=True)) 

< [('Seeds', {'nutrient': False}), ('B12', {'nutrient': True}), 

('B6', {'nutrient': True}), ('Se', {'nutrient': True}), 

('Cu', {'nutrient': True}), ('Asparagus', {'nutrient': False}), 

('Broccoli', {'nutrient': False}), ('Poultry', {'nutrient': False}), 

('Eggs', {'nutrient': False}), ('D', {'nutrient': True}), 

('Ca', {'nutrient': True}), ('Whole Grains', {'nutrient': False}), 

('Beef, {'nutrient': False}), ('Thiamin', {'nutrient': True}), 

('Shellfish', {'nutrient': False}), ('Kidneys', {'nutrient': False}), 

('Riboflavin', {'nutrient': True}), ('Spinach', {'nutrient': False}), 
('Cheese', {'nutrient': False}), ('Beans', {'nutrient': False}), 

('C', {'nutrient': True}), ('Veg Oils', {'nutrient': False}), 

('Tomatoes', {'nutrient': False}), ('E', {'nutrient': True}), 

('Mushrooms', {'nutrient': False}), ('Liver', {'nutrient': False}), 

('Zn', {'nutrient': True}), ('Niacin', {'nutrient': True}), 

('A', {'nutrient': True}), ('Folates', {'nutrient': True}), 

('Legumes', {'nutrient': False}), ('Yogurt', {'nutrient': False}), 

('Nuts', {'nutrient': False}), ('Mn', {'nutrient': True}), 

('Milk', {'nutrient': False}), ('Wheat', {'nutrient': False}), 

('Green Leafy Vegs', {'nutrient': False}), 

('Pumpkins', {'nutrient': False}), ('Carrots', {'nutrient': False}), 
('Potatoes', {'nutrient': False}), ('Fatty Fish', {'nutrient': False})] 

We will get back to the node attributes in Esttmate Network Uniformity Through 

Assortatiuity, on page 97. But how does the network appear? 

Visualize a Network with Matplotiib 

NetworkX is not aesthetically the best libraiy for network 
visualization. In fact, it does not even do visualization on its 
own but uses Services rendered by Matplotiib, a multipurpose 
graphics libraiy. Luckily, the team of NetworkX and Matplotiib 
is fast and easy to understand, and the interaction with Matplotiib is weU hidden 
from the network analyst; you do not need to leam yet another libraiy—but 
stili, you need to import it. 

import matplotiib.pyplot as plt 


This section uses 
Matplotiib. 
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Note that you typically do not need the whole library, but just the submodule 
matplotiib.pyplot. 

The proeess of network visualization eonsists of two phases: layout and ren- 
dering. At the layout phase, the Software seleets geometrie positions of each 
node aeeording to a layout algorithm. 

NetworkX supports a variety of layout algorithms. You can ehoose one of them, 
based on your aesthetic preferences and your network’s aesthetie propensity. 
For each algoiithm, NetworkX has a proper layout function that takes the graph 
to plot and returns a dictionaiy of node positions (to be used at the rendering 
phase), and an all-in-one function that does both layout and rendering. The 
following table shows the most useful layout and drawtng functions. The figure 
on page 28 shows some actual layouts. 


Layout 

Arrange node... 

Layout function 

All-in-one 

function 

Random 

Randomly (this layout 
requires NumPy) 

pos=nx.randomJayout() 

nx.drawjandomO 

Circular 

On a circle 

pos=nx.circularJayout() 

nx.draw_circular() 

Shell 

On concentric circles, 
as defined by nlist 

pos=nx.shell_ 

layout(G,nlist=None) 

nx.draw_shell() 

Spectral 

Based on their eigenvec- 
tor centrality values 
(see Eigenvector Central- 
ity, on page 94) 

pos=nx.spectralJayout() 

nx.draw_spectral() 

Force- 

As If they were physical 

pos=nx.fruchterman_ 


directed 

balls that repel one 
another, cormected 
with springs 

reingoldJayoutO 


Same as 

above 

Same as above 


nx.draw_networkx() 

Same as 

above 

Same as above 

pos=nx.springJayout() 

nx.draw_spring() 


At the rendering phase, NetworkX draws the nodes, labeis, and edges at the 
prescribed positions, using the default or specified shapes, fonts, and colors. 
You can see the graphical output on the screen, save it into a file (supported 
formats include PNG, PDF, PostScript, EPS, and SVG), or both. In the latter 
case, you must first save the image and only then display it. 
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► 


The following code fragment prepares a color sequence (pink vs. yellow, 
depending on the node type) for the nodes. 

nutrients.py 

# Prepare for drawing 

colors = [“yellow" if n[l] ["nutrient"] else “pink" for n in 
G.nodes(data=T rue)] 

dzcnapy.medium_attrs["node_color"] = colors 

For each of the four layouts (specified by the subplot axes, layout method, 
and human-readable title), we calculate the node positions and call 
nx.draw networkxO, the generic drawing method. 

nutrients.py 

# Draw four layouts in four subplots 

plot = plt. subplots(2, 2) 

subplots = plot.reshape(l, 4)[0] 

layouts = (nx.random_layout, nx.circular_layout, nx.spring_layout, 
nx.spectral_layout) 

tities = (“Random" , "Circular" , "Force-Directed" , “Spectral") 
for plot, layout, title in zip(subplots, layouts, tities): 
pos = layout(G) 

nx.draw_networkx(G, pos=pos, ax=plot, with_labels=False, 

**dzcnapy.medium_attrs) 
plot.set_title(title) 
dzcnapy.set_extent(pos, plot) 

NetworkX doesn’t take very good care of scaling network charts. You can do 
better by manually calculatmg the extent of each layout and reserving enough 
space. The highllghted code calls function dzcnapy_plotlib.set_extent(pos, plot) from 
the module dzcnapy plotlib.^ The function fits a network with the nodes at the 
positions pos into the drawable plot. 

Finally, teli Matplotiib to pack the subplots as tight as possible and save and 
display the images by calling the auxiliaiy function dzcnapy_plotllb.plot(). 

nutrients.py 

dzcnapy.plot( “nutrients" ) 

The figure on page 28 shows four different drawings of the same network of 
foods and nutrients. Remember that most of the layout procedures are 
probabilistlc. If you run the code yourself, you will likely get a somewhat dif¬ 
ferent layout. 


2. Avallable from pragprog.com/book/dzcnapy 
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Random Circular 



Force-Directed Spectral 



For most complex networks, the spring layout (the default layout for 
nx.draw networkxO) produces the most pleasing output. Note that Matplotiib does 
not accurately place node labeis (to the extent that 1 preferred not to show 
them in the figure at all). Some of them badly overlap. If you think this is a 
problem (and yes, it is!), switch to Gephi —it is described in Chapter 4, Introduc¬ 
ing Gephi, on page 31. But try graphviz first! 


Harness Graphviz 

graphviz is an open source graph visualization tool written in C, with bindings 
available tn C, Tcl/Tk, guile, Java, Perl, PHP, Ruby, and, most importantly, 
Python. Among other things, it provides yet another layout engine, which is 
typically better than any of the engines mentioned previously. 

Using graphviz in your code is trivial. The only two lines affected by the switch 
from, say, nx.draw networkxO to graphviz are highlighted tn the following code 
fragment. Due to the better overall layout quality, the node labeis have better 
chanees of not overlapping and should not be disabled. 

nutrients.py 

>■ from networkx.drawing.nx_agraph import graphviz_layout 

plot = plt.subplots() 

>• pos = graphviz_layout(G) 

nx.draw_networkx(G, pos, **dzcnapy.attrs) 
dzcnapy.set_extent(pos, plot) 
dzcnapy.plot( “nutrients-graphviz" ) 
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The followlng figure shows the output of graphviz. Compare it to the flgure on 
page 8 or to any graph in the previous figure. The difference is impressive. 



As you are getting ready to save your network into a file, here’s some food for 
thought: why do all pink nodes have yellow neighbors and the other way 
around? 

Share and Preserve Networks 

At this point, you must he veiy proud of the joh well done. The network of 
foods and nutrients has been extracted, constructed, and visualized. It’s time 
to save it into a file, and there are several eompelling reasons for doing so: 

1. You never analyzed the network, because you don’t know how. When you 
read through Chapter 8, Measurtng Networks, on page 83 and Chapter 11, 
Unearthing the Network Structure, on page 125, you’ll be able to caleulate 
various network measures and extract network structure. You’U need the 
network when you get there, but you won’t have it unless you save it now. 

2. You may want to get a better network visualization of the network with 
Gephi. The only way for NetworkX to pass the network to Gephi is via a file. 

3. Shaiing the network with the fellow researchers is priceless, but they 
want a file with the live network edges and nodes, not a stili image. 

NetworkX supports many popular file formats suitable for interchanging network 
data with other Software. 
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Export and Import Networks 

Any NetworkX network can be exported to or imported from files [serialized and 

de-serialized) in the formats shown in the following table. All nx.read _() func- 

tions take the name of an existing file or an open file handle and retum a 

Graph objeet. All nx.write _() functions take a Graph object and the name of an 

existing file or an open file handle. Files with names ending in .gz or .bz2 are 
automatieally compressed or uneompressed. Some functions require that the 
files be opened in the binary mode. 


Format 

Attributes 

Reader 

Writer 

Supported 
by Gephi? 

Adjacency list 

Not stored 

nx.read_adjiist() 

nx.write_adjiist() 

Yes 

Edge list 

Not stored 

nx.read_edgeiist() 

nx.write_edgeiist() 

Yes 

Graph exchange 
XML format 

Stored 

nx.read_gexf() 

nx.write_gexf() 

Yes 

Graph modeling 
language 

Stored 

nx.read_gmi() 

nx.write_gmi() 

W/o 

attributes 

GraphML 

Stored 

nx.read_graphmi() 

nx.write_graphmi() 

Yes 

Pajek NEIT 

Not stored 

nx.read_pajek() 

nx.write_pajek() 

Yes 

Pickle 

Stored 

nx.read_gpickie() 

nx.write_gpickie() 

No 

YAML 

Stored 

nx.read_yami() 

nx.write_yami() 

No 


As an example, let’s export the G network as a GraphML file. In my experience, 
GraphML is the best interchange format between NetworkX and Gephi. 


nx.write_graphml(G, "nutrlents.graphml " ) 

Or: 

with openCnutrients.graphml" , "wb") as ofile: 
nx.writegraphml(G, ofile) 

You learned about the foundations of NetworkX —a powerful Python library for 
network analysis and visualization. You know how to construet a simple 
graph incrementally, add node attributes, do some simple visualization, and 
save the networks to and restore from files. Of these tasks, it is visualization 
that NetworkX does not handle well. 

YouVe heard so much about Gephi so far that you should not be surprised 
that it is hiding around the comer [of this page]. Just for one chapter, let’s 
set NetworkX aside and look into this Interactive CNA tool. Sometimes, Gephi is 
the quickest way to analyze a not-so-large network once or twice. 


report erratum • discuss 





Visual signaling is and always will be a most valuable means of 
transmitting Information in peace and war, and it is not to be imagined 
that it will ever be suppiantedin its particular function by the introduction 
ofother methods. 

Signal Corps United States Army 

chapter4 

Introducing Gephi 

When you explore an unfamlliar complex network for the first time, it often 
helps to perform a quick visual check of its structure before engaging in 
expensive code writing. Sometimes, you can semi-automate even the network 
construction itself (we are talking about small networks, with fewer than a 
couple dozen nodes). 

You can perform many one-time construction, analysis, and conversion tasks 
with Gephi. Gephi is a free, Java-based, interactive CNA environment that runs 
on all mainstream operating systems. In this chapter, you will learn how to 
use Gephi and the data from Draw Your First Network with Paper and Peneii, 
on page 6 to build and layout a network of nutrients and food types, add 
attributes to its nodes and edges, and save the network as a presentation- 
quality image. You will also learn how to interchange network files between 
Gephi and NetworkX. 

Worth 1,000 Words 

Gephi ^ is not a part of NetworkX, and you must install it separately. Luckily, the 
installation process is straightforward.^ 

Here’s a summaiy of Gephi’s capabilities: 

• Import existing networks in a variety of formats or create a new network. 

• Edit a new or existing network by adding or removing nodes or edges. 

• Change the size and color of node icons, the size and color of label font, 
and the color and thickness of edges based on node and edge attributes. 


1. gephi.org 

2. gephi.org/users/install/ 
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• Calculate basic network measures: centralities, clustering coefficients, 
path length distributions, connectedness, and modularity. 

• Apply various layout engines to the network graph. 

• Execute additional plug-ins. 

• Export modified networks in a variety of formats. 

• Save network visualization as a PNG, PDF, or SVG file. 

If it looks like eveiything you ever wanted to do as a network analyst is already 
on the list, do not get overexcited. Gephi is an excellent network construction 
and analysis tool, but it is Interactive. The human user is its slowest component 
(yes, this means you are slowing things down). Gephi carmot be programmed to 
execute batches of tedious analysis tasks, vary parameters automatically, or 
integrate with machine leaming or predictive analytics Software. 

Nonetheless, Gephi is a great lightweight “Paintbrush”-style applieation and 
NetworkX eompanion. Let's use it to play with the nutrients network acquired 
in Chapter 3, Introducing NetworkX, on page 17. 

Import and Modify a Simple Network with Gephi 

The following figure shows the Standard main window of Gephi without any 
loaded graphs. 



The main window contains three tabs: OverView (for interactive network creation 
and exploration). Data Laboratory (for text and numerie editing and research), 
and Preview (for presentation-quality visualization). We will start in the OverView 
tab, then briefly stop at the Data Laboratory, and finish in the Preview. 

The default OverView tab has four empty Windows. The upper-left window, 
Appearance, is in charge of renderrng. You will use it to eontrol node and edge 
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presentation properties. The lower-left wlndow, Layout, Is behind the graph 
layout. The Central window, Graph, shows the sketch of the network. The 
wlndow on the right, with the tabs Statistics and Filters, is for network analysis 
and filtermg (more about flltering on page 14). You can freely move the Windows 
around, remove them, and add more Windows from the Window menu. 

To import an existing network file prepared by NetworkX or any other CNA 
Software—in our case, nutrients.csv —open it with File > Open (Gephi displays 
only the file that it “knows” how to interpret.) Choose the Undirected graph 
type. Check if the number of detected edges and nodes makes sense. If the 
import is successful, you will see an ugly black-and-white sketch of your 
network in the Graph window. Don’t woriy: we will make it look awesome by 
the end of this chapter. 

By default, Gephi treats networks as mathematical graphs and does not display 
node labeis. Click the fat T button at the bottom of the Graph window, and 
the labeis will show up, making the ugly sketch even uglier. 

The original black color of the nodes is depressing. Before we go any further, 
click the artisfs palette icon m the Appearance window, select the Unique 
tab, and choose a fun color—say, light blue—for all the nodes. (Don’t forget 
to click Apply!) 

Suppose that at this point you realize that eggs are a valuable source of 
selenium (they are) and you would like to add a cormection between the 
respective nodes. You remember that eggs are already on the network, but 
you’re not sure about selenium, so you add it to the network, just in case. (It 
is a mistake, but you will be able to correct it later.) Choose the Node Peneii 
tool in the Graph wlndow, adjust the color and size of the new node, rf desired, 
and click an 3 nvhere in the wlndow. The position of the new node doesn’t 
matter; the node will be moved around by the layout procedure. 

Now, choose the Edge Peneii tool in the same window. Click the “eggs” node 
and then the new nameless one. Congratulations, you added a new edge! 

Finally, take care of the anonymous node. It deserves a name, so you’re going 
to name it “Se” for “selenium.” Click the Edit tab of the Appearance window, 
then the node of interest. Edit the label name (the default value is “<null 
value>”)—and realize that now you have two selenium nodes! 

Visit Data Laboratory 

Data Laboratoiy (m addition to the Edit tab) is another place to look at the edges 
and nodes under a microscope and mspect and modify their properties. The 
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operatiori that you’re looklng for—detection of duplicate nodes—Is hldden In 
the pull-down menu “More actlons” at the top of the Data Table wlndow. It Is 
caUed “Detect and merge node dupllcates.” Whenyou Invoke the operatlon, Gephi 
reports that It found one palr of duphcates, and offers to merge them Into one 
node. Thls procedure preserves all edges Incident to the nodes. In partlcular, rf 
there Is an edge cormectmg the dupllcates, It becomes a self-loop edge. 

Unfortunately, Gephi does not teli you whlch nodes are dupllcates. You can 
elther trust the tool or manually flnd the dupllcates In the table, select both 
nodes, open the pop-up menu, and select “Merge nodes...” In It. Elther way, 
you can go back to the OverView wlndow, and you’ll see that there Is now only 
one Se node, and It Is properly connected to the eggs. The network now looks 
llke the followlng flgure. Node posltlons and fonts may differ, but surely the 
new Image Is a vast Improvement over the black-and-whlte sketch. 



Explore the Network 

NetWork exploratlon In Gephi goes hand In hand wlth selectmg vlsual propertles. 
Let’s palnt and reslze the graph nodes based on some of thelr measures. 
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You will leam about network measures in Chapter 8, Measuring Networks, 
on page 83 and Chapter 11, Unearthing the Network Structure, on page 125. 
For now, it suffiees to know several basic faets about two of them, as detailed 
in the following table: 

Measure Meaning 

Degree The number of immediate neighbors—adjaeent nodes. The 

degree is a non-negative integer number. The larger the 
degree of a food item is, the more nutrients it provides. 

The larger the degree of a nutrient is, the more food items 
provide it. 

Community Nodes form tightly knit groups ealled eommunities. All 
strueture foods and nutrients within a eommunity serve some eom- 

mon purpose. Eaeh community has a uni que integer 
identifier ealled modularity class. 

Node degree is the simplest possible node measure. There is no need to cal¬ 
culate it explicitly. To make node size proportional to the degree, click the 
icon with concentric circles in the Appearance window, then on the Ranking 
button. Select Degree from the “—Choose an attribute” pull-down menu. 
Select node sizes that correspond to the smallest and largest degrees (10 and 
40 are good choices). And don’t forget to click Apply. Can you see which nodes 
have the highest degrees? 

Pla 3 dng with size is fun; playing with color is more fun. Let’s paint the nodes 
according to their modularity classes, as IVe done in the figure on page 36. 
To partition a network into eommunities, click the Run button next to the 
Modularity command in the Network OverView section of the Statisties window. 
Proceed by clicking OK and Close in the next two dialogs. In the end, you will 
see a floating-point number next to Modularity. The number is a measure of 
the quality of the decomposition from Outline Modularitg-Based Communities, 
on page 136. Retum to the artisfs palette icon, then click the Partition button. 
Select Modularity Class from the “—Choose an attribute” pull-down menu. 
Don’t forget to click Apply. If you feel artistic, like me, play with the node 
colors. Painting the beef group pink and the vegetable group green is a no- 
brainer, but can you choose a good color for the vitamins? 

If you plan to use Gephi for more sophisticated CNA jobs, the table on page 36 
will help you find which sections of this book mateh the items in the Statisties 
window. 
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Measure 

NetworkX reference 

Average Degree 

Degree Centrality, on page 92 

NetWork Diameter 

Think in Teims of Paths, on page 88 

t Also calculates betweenness 
centrality 

Betweenness Centrality, on page 93 

t Also calculates closeness 

Closeness and Harmonic Closeness Central- 

centrality 

ity, on page 93 

t Also calculates eccentriclty 

Networks as Circles, on page 91 

Graph Density 

Start with Global Measures, on page 83 

HITS (Hubs and Authorities) 

HITS Hubs and Authorities, on page 95 

Modularity 

Outline Modularity-Based Communities, on 
page 136 

PageRank 

PageRank, on page 94 

Connected Components 

Split Networks into Connected Components, 
on page 126 

Avg. Clustering Coefflcient 

Explore Neighborhoods, on page 84 

Eigenvector Centrality 

Eigenvector Centrality, on page 94 

Avg. Path Length 

Think in Terms of Paths, on page 88 
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Sketch the Network 

You’re done wlth rough rendering, but the layout is stili awful. Let’s turn 
our attentiori to the lower-left eorner of Gephi. Select your favorite layout from 
the “Choose a layout” pull-down menu. When a network is large (500 or more 
nodes), the Fruehterman-Relngold layout is usually the most effieient. For 
smaller networks, the ForceAtlas 2 layout, wlth some tweaklng, works mar- 
vels. To make things easier, here’s a tip: set the scaling to 100.0 (to place 
nodes reasonably far apart) and check the Prevent Overlap box. Run the tool 
for a while. You’ll notice that after a couple of seconds, the nodes settle at 
their new positions, but the graph as a whole may continue drifting, rotating, 
or both. 

The last step is to adjust the labeis, because surely some of them don’t mind 
their manners and sit on top of each other. Select Label Adjust from the 
“Choose a layout” pull-down menu and run the tool for a couple of seconds. 
This layout engine distorts the original Fruchtemian-Reingold but makes 
sure that neither nodes nor labeis overlap. Hopefully, your network layout 
resembles the following figure. 
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The Graph wlndow stili shows a sketch, but this is a high-quality sketch 
that nicely displays the structure of the network of foods and nutrients. 
There are five compact groups in the network that could be tentatively called 
Veggies, Cereals, Meats, Proteins, and Folates. You can explore each group’s 
internal composition, as well as connections to the other groups. You can 
even show this sketch to your boss or customer. But it would look much 
better when rendered at high resolution and converted to a presentation- 
quality image. (Save the project via File > Save As... into a .gephi file to avoid 
data loss if Gephi crashes!) 

Prepare a Presentation-Quality Image 

The final painting of the network takes place in the Preview tab. The tab has 
two Windows: Preview Settings (to control the fine-level rendering engine) 
and Preview (to see the rendering results). When you switch to this tab, the 
Preview window will be empty. Click the Refresh button. Your network will 
be drawn using the default settings, with curved edges and no node labeis 
(see the following figure). Don’t woriy—the labeis have not been lost; they 
have been turned off. 
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Gephi supports a variety of preset renderers. The default one is usually not 
the best one. I reeommend uslng the “Text outline” preset conflguration wlth 
some extra tweaking: 

• Unclick “Proportional size.” 

• Increase the Pont size (in this ehapter, I used Comfortaa 24pt). 

• Increase edge Thickness (I used 2.0). 

• Set edge Opacity at 75.0. 

• Optionally, unclick the Curved box. 

Click the Refresh button agam. 

If you’re not happy with what you see, you can make more changes. You can 
zoom into the network and zoom out of it, pan the image, control how many 
edges and nodes are rendered, and so on. You can go back and forth between 
the Preview and OverView to adjust layout and node and edge visualization 
properties. Eventually, your network will look similar to the following figure. 



Once you like the rendering results, you can export the graph into a graphics 
file. Gephi provides exporters to SVG (editable vector format), PDF (editable 
vector format), and PNG (non-editable raster format). The final version of the 
network of foods and nutrients is in the figure on page 40. 


report erratum • discuss 





















Chapter4. Introducing Gephi • 40 



A4 vs. Letter 



The default page format for the Gephi PDF exporter is European A4 (also known as 
DIN A4 in Germany; it measures 210 by 297 milllmeters, or 8.27” x 11.7”). A4 
paper is slimmer and taller than the letter-size paper (at 8.5” x 11”) used in the 
United States, Canada, ChOe, Colombia, Venezuela, the Philippmes, and most 
Central American countrles. If you Uve in one of the “letter” countrles, make sure 
to change the paper slze. 


Once agaln, you are invlted to compare this figure wlth the hand-drawn network 
on page 8! 


Combine Gephi and NetworkX 

There is no impllcit Integratlon between Gephi and NetworkX. However, you can use 
Gephi graph file exporter (File > Export > Graph file...) to save your network rnto 
a file that could be imported by NetworkX. GraphML is the preferred interchange 
format because it preserves all calculated measures (such as centralities and 
modularity classes) as node attributes. This way you could use Gephi to perform 
a qulck-and-dirty Interactive analysis of a network and save it into a .graphmi file 
for further processlng. 

You just leamed one more way to build and analyze networks by hand, and you 
now understand that analyzing one not-so-large, pre-packaged complex network 
wlth Gephi is a pleasure. You also understand that constmctlng and analyzing 
many large networks by hand is hard, error-prone, and most of the time infeaslble. 
You are now ready for the flrst full-Python case study. 
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Looking down, he surveyed the rest ofhis dothes, which in parts 
resembled the child's definition ofa net as a tot ofholes tied 
together with string... 

Morley Roberts, English novelist and short story writer 


chapterS 

Case Study: Constructing a NetWork 

of Wikipedia Pages 

So far you have leamed two ways of constructing a complex network: a hard 
one (from a CSV file and further in Gephi, Construet a Simple Network with 
NetworkX, on page 17) and a veiy hard one (by hand on paper, Draw Your 
First Network with Paper and Peneii, on page 6). What is hard for small net- 
works may be impossible for medium-to-large scale networks; it may be 
impossible even for small networks if you must repeat the analysis many 
times. The case study in this chapter shows you how to construet a large 
network in an easy way: by automatically collecting node and edge data from 
the Internet. 

The other goal of this study (aside from mastering new network construction 
techniques) is quite pragmatic. Wouldn’t you want to know where the complex 
network analysis fits in the context of other subjects and disciplines? An 
answer to this question is near at hand: on Wikipedia.' 

Let’s start with the Wikipedia page about complex networks—the seed page. 
(Unfortunately, there is no page on complex network analysis itself.) The page 
body has external links and links to other Wikipedia pages. Those other pages 
presumably are somewhat related to complex networks, or else why would 
the Wikipedia editors provide them? 

To build a network out of the seed page and other relevant pages, let’s treat 
the pages (and the respective Wikipedia subjects) as the network nodes and 
the links between the pages as the network edges. You will use snowball 
sampllng (explained on page 7) to discover all the nodes and edges of interest. 


1. en.wik:ipedia.org/wiki/Complex_network 
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As a resuit, you wlll have a network of ali pages related to eomplex networks 
and hopefully, you will make some concluslons about it. 

Get the Data, Build the Network 

The first half of the project script eonsists of the initialization 
prologue and a heavy-duty loop that retrieves the Wikipedia 
pages and simultaneously builds the network of nodes and 
edges. 

Let’s first import all necessaiy modules. We wlll need the module wikipedia for 
fetching and exploring Wikipedia pages, the operator itemgetter for sorting a 
list of tuples, and, naturally, networkx itself. 

To target the snowballlng process, define the constant SEED, the name of the 
starting page. As a side note, by changing the name of the seed page, you can 
apply this analysis to any other subject on Wikipedia. 

Last but not least, when you start the snowballlng, you will eventually (and 
quite soon) bump into the pages describlng ISBN and ISSN numbers, the 
arXiv, PubMed, and the like. Almost all other Wikipedia pages refer to one or 
more of those pages. This hyper-connectedness transforms any network into 
a collection of almost perfect gigantic stars, making all Wikipedia-based net¬ 
works look slmilar. To avoid the stardom syndrome, treat the known “star” 
pages as stop words in Information retrieval—in other words, ignore any llnks 
to them. Constructing the black list of stop words, STOPS, is a matter of trial 
and error. I put twelve subjects on it; you may want to add more when you 
come across other “stars.” I also excluded pages whose names begln wlth 
"List of, because they are simply lists of other subjects. 

wikiZnet.py 

from operator import itemgetter 
import networkx as nx 
import wikipedia 

SEED = "Complex networ/c" .title() 

STOPS = {“International Standard Serial Number" , 

"International Standard Book Number", 

"National Diet Library" , 

“International Standard Name Identifier" , 

"International Standard Book Number (Identifier)" , 

“Pubmed Identifier" , "Pubmed Central", 

“Digital Object Identifier" , "Arxiv" , 

"Proc Nati Acad Sci Usa", "Bibcode" , 

“Library Of Congress Control Number", "Jstor") 


This section uses 
Wikipedia. 


report erratum • discuss 


Get the Data, Build the NetWork • 43 


The next code fragment deals wlth setting up the snowhalling process. A 
hreadth-first search, or BFS (sometlmes known to computer programmers 
as a snowhalling algorithm), must remember which pages have been already 
processed and which have been discovered but not yet processed. The former 
are stored in the set done set; the latter, in the list todojst and set todo set. You 
need two data structures for the unprocessed pages because you want to 
know whether a page has been already recorded (an unordered lookup) and 
which page is the next to be processed (an ordered lookup). The Aside titled 
“Ready, Set(), Go” on page 25 explains why the two operations favor different 
data structures. 

Snowhalling an extensive network—and Wikipedia with 5,452,810 articles in 
the English segment alone can produce a huge network!—takes considerable 
time. Suppose you start wlth one seed node, and let’s say it has N~100 neigh- 
bors. Each of them has N neighbors, too, to the total of »N+NxN nodes. The 
thrrd round of discovery adds -NxNxN more nodes. The time to shave each 
next layer of nodes grows exponentially. For this exercise, let’s process only 
the seed node itself and its immediate neighbors (layers 0 and 1). Processing 
layer 2 is stili feasible, but layer 3 requires NxNxNxN^lO® page downloads— 
close to one year of your machme time. To keep track of the distance from 
the currently processed node to the seed, sto re both the layer to which a node 
belongs and the node name together as a tuple on the todojst list. 

wiki2net.py 

todo_lst = [(0, SEED)] # The SEED is in the layer 0 
todo_set = set(SEED) # The SEED itself 
done_set = set{) # Nothing is done yet 

The output of the exercise is a NetworkX graph. The next fragment wlll create 
an empty directed graph that wlll later absorb discovered nodes and edges. 
We choose a directed graph because the edges that represent HTML links 
are naturally directed: a link from page A to page B does not Imply a recip- 
rocal link. 

The same fragment primes the algorithm by extracting the first “to-do” item 
(both its layer and page name) from the namesake list. 

wiki2net.py 

F = nx.DiGraph{) 

layer, page = todo_lst[0] 

It may take a fraction of a second to execute the frrst five lines of the script. 
It may take the whole next year or longer to finish the next twenty lines 
because they contain the main collection/construction loop of the project. 
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wiki2net.py 
while layer < 2: 

O dei todo_lst[0] 

doneset.add(page) 

print(layer, page) # Show progress 

© try: 

wiki = wikipedia.page{page) 

except : 

layer, page = todo_lst[0] 

print( "Could not load" , page) 

continue 

O for llnk in wiki.links: 

link = link.title{) 

if link not in STOPS and not link.startswith("List Of'): 
if l ink not in todoset and link not in done_set: 
todolst.append((layer + 1, link)) 
todoset.add(link) 

>■ F.add_edge(page, link) 

O layer, page = todo_lst[0] 

printCO nodes, {} ecfges" .format (len(F), nx. number_of_edges(F))) 

# 11597 nodes, 21331 edges 

The loop is programmed to collect ali nodes that are at most two steps away 
from the seed node. They are reachable from the nodes in layer 1, and all 
those nodes will have been harvested when the loop terminates. The loop 
body consists of the following four blocks: 

O Remove the name page of the current page from the todo lst, and add it to 
the set of processed pages. If the script encounters this page again, it will 
skip over it. 

O Attempt to download the selected page. If the attempt is unsuccessful 
(things happen!), proceed to the next page from the “to-do” list. 

e Evaluate each link. If the subject is not blacklisted and not a list itself, 
the script adds an edge to the graph between the current node and the 
linked page. If the script did not process the linked page before and it is 
not on the “to-do” list, add it to the list and corresponding set. Note that 
the highlighted code line is involved in the network construction—the 
only line in the script! 

O Take the next page name from the “to-do” list. Hopefully, the list is not 
empty. If it is—congratulations, you just downloaded the complete 
Wikipedia! 

The network of interest is now in the variable F. But it is “dirty”: inaccurate, 
incomplete, and erroneous. 
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Eliminate Duplicates 

Many Wikipedia pages exist under two or more names. For example, there 
are pages about Complex Network and Complex Networks. The latter redirects 
to the former, but NetworkX does not know about the redirection. 

Aecurately merging all duplicate nodes involves natural language processing 
(NLP) tools that are outside of the scope of this book. It may suffice to join 
only those nodes that differ by the presence/absence of the letter s at the end 
or a hyphen in the middle. 

Start removing self-loops (pages referring to themselves). The loops don’t change 
the network properties but affect the correctness of duplicate node eUmination. 

Now, you need a list of at least some duplicate nodes. You can build it by 
looking at each node in F and checking if a node with the same name, but 
with an s at the end, is also in F. Pass each pair of duplicated node names to 
the function nx.contracted_nodes(F,u,v) that merges node v into node u in the graph 
F. The function reassigns all edges prevlously incident to v, to u. If you don’t 
pass the option selfJoops=False, the function converts an edge from v to u (if 
any) to a self-loop. 

wiki2net.py 

F.remove_edges_from(F.selfloop_edges()) 

duplicates = [{node, node + "s") for node in F if node + "s" in F] 
for dup in duplicates: 

F = nx.contracted_nodes(F, *dup, self_loops=False) 
duplicates = [(x, y) for x, y 

in [(node, node.replace( "- " , " ")) for node in F] 
if X 1= y and y in F] 
for dup in duplicates: 

F = nx.contracted_nodes(F, *dup, self_loops=False) 

>• nx.set_node_attributes(F, “contraction" , 0) 



Thou Shall Not Contract Seif-Loops 

Due to a bug in the implementation of nx.contracted nodesO in NetworkX 
1.11 and earlier versions, the function/ails to merge duplicates if 
one of them has a self-loop. Hopefully, this bug will be fixed in the 
future. For now, either permanently delete all self-loop edges before 
eliminating duplicates, or compose a list of the self-loop edges, 
remove them, elimmate the duplicates, and add the self-loop edges 
back to the network graph. 


As a side effect, nx.contracted nodesO creates a new node attribute (see Add 
Attributes, on page 23) called contraction. The value of the attribute is a dic- 
tionary, but GraphML does not support dictionary attributes. The highlighted 
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code Une sets the contractiori property to 0 for all nodes to avoid further 
troubles with exporting the graph. You could also delete the attribute for each 
node n wlth dei n["contraction"] in a loop. 

Truncate the NetWork 

Why did you go through all those Wikipedia troubles? First, to construet a 
network of subjects related to complex networks—and here it is. Second, to 
find other significant topics related to complex networks. But what is the 
measure of signtficance? 

You will discover a variety of network measures in Chapter 8, Measuring 
Networks, on page 83. Some of them are excellent proxies for node importance. 
For now, let’s concentrate on a node indegree —the number of edges directed 
into the node. (In the same spirit, the number of edges directed out of the 
node is called outdegree.) The indegree of a node equals the number of HTML 
links pomting to the respective page. If a page has a lot of links to it, the 
topic of the page must be significant. 

The choice of indegree as a yardstick of significance incidentally makes it 
possible to shrink the graph size by almost 75 percent. The extracted graph 
has 11,390 nodes and 20,392 edges—an average of 1.8 edges per node. Most 
of the nodes have only one connection. (Interestingly, there are no isolated 
nodes with no connection in the graph. Even If they exist, you will not flnd 
them because of the way snowballing works.) You can remove all nodes with 
only one incident edge to make the network more compact and less haiiy 
without hurting the final results. Why? 

• If a node has one incoming edge, then removing the node affects the out¬ 
degree of some other node, but you do not care about outdegrees. 

• If a node has one outgoing edge (and the node is not the seed), you could 
not have found it, at least not with snowballing. 

As you can see, the following code fragment safely removes 75 percent of the 
nodes and 45 percent of the edges, raising the average number of edges per 
node to 3.9. 

wiki2net.py 

core = [node for node, deg in F.degree().items() if deg >= 2] 

G = nx.subgraph(F, core) 

printCO nodes, {} edges" . format (len (G), nx. number_of_edges(G))) 

# 2995 nodes, 11817 edges 
nx.write_graphml(G, "cna.graphml" ) 
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Start by calling F.degreeO. The method retums a dictionaiy with nodes as keys and 
degree as values. Note that at this point, you don’t need to distinguish indegrees 
and outdegrees (for the reasons explained prevlously). Expand the dictionaiy Into 
a list of key/value tuples and select the nodes whose degree is at least 2—the 
“dense core” of the network. 


Function nx.subgraph(F, core) collects all core nodes from F and all edges connecting 
them and buUds a new graph G —a subgraph of F. (Naturally, F has a lot of different 
subgraphs. Even F itself is a subgraph of F.) G is a truncated version of F. Write it to 
a GraphML file so that you don’t have to rebuild it rf you need it later agaln. 


Explore the Network 


The following figure is a Gephi rendering of G. The “Complex Network” node is 
barely visible light in the middle of the image denoted with a darker color. Node 
and label font sizes represent the indegrees. The most in-connected, most signif¬ 
icant nodes are in the upper-left comer of the network. What are they? 
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The last code fragment of the exercise efficiently calculates the answer by 
calling the method G.in_degree(). The method (and its sister method G.out degreeO) 
are veiy similar to G.degreeO exeept that they report different edge counts. 

wiki2net.py 

top_indegree = sorted(G.in_degree().items(), 

reverse=True, key=itemgetter(1))[:100] 
printCln" .ioin(map(lambda t: "{} .format(*reversed(t)), top_indegree))) 

Sort the list of item tuples in the order of decreasing indegrees, piek the top 
one hundred items, and print their indegrees and names. These are the one 
hundred most significant subjects that, according to Wikipedia, go along 
with complex networks. The first twenty-five of them are listed in the table 
on page 49. 

It appears almost magical that this book covers the majority of these and the 
remaining seventy-five automatically extracted most significant topics. On 
second thought, the outcome of the experiment merely confirms that the 
Wikipedia link structure reflects the structure of the complex networks 
analysis field. I encourage you to use the code from this case study to explore 
other Wikipedia subjects that may be of interest to you. 

This chapter presented a complete complex network construction case study, 
starting from the raw data in the form of HTML pages, all the way to an ana- 
lyzable annotated network graph and a simple exploratoiy exercise. This is 
a good foundation for more systematic complex network studies. 

In the Next Part 

You are about to move away from simple networks of nutrients and Wikipedia 
pages to the vast and venerable realm of social networks. In fact, as you 
leamed in the introduction, social network analysis predates complex network 
analysis and serves as one of the cornerstones of the CNA. In the next part, 
you will learn how to construet, measure, interpret, and understand complex 
social networks, as well as the networks that resemble them. 


report erratum • discuss 




61 

54 

49 

48 

44 

44 

44 

39 

39 

39 

38 

37 

37 

36 

36 

36 

36 

36 

36 

36 

35 

35 

35 

35 

35 

35 


Explore the Network • 49 


Subject 

Graph (Discrete Mathematics) 
Vertex (Graph Theoiy) 

Drrected Graph 
Social NetWork 
Graph Theoiy 
Degree (Graph Theoiy) 

NetWork Theoiy 
Edge (Graph Theoiy) 

Adjacency Matrix 
Complete Graph 
Bipartite Graph 
Scale Free Network 
Graph (Ahstract Data T^pe) 

Social Capital 

Network Science 

Small World Network 

Incidence Matrix 

Social Network Analysis Software 

Centrality 

Loop (Graph Theoiy) 

Complex Contagion 
Complex Network 
Random Graph 
Path (Graph Theoiy) 

Distance (Graph Theoiy) 

Graph Drawing 


Covered in this book? 

■/ 

■/ 

Somewhat 

■/ 


Somewhat 
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Part II 

Networks Based on Explicit 
Relationships 


Social networks—the networks of individuals, 
groups, or organizations—are historically the 
longest-studied complex networks. They are based 
on explicit relationships between the nodes, such 
asjriendship, ktnship, membership, and interaction. 




Ifdogs could reason and critidze us they'd be sure to findjust as 
much that wouid be funny to them, ifnot far more, in the socia! 
relations ofmen, their masters—far more, indeed. 

Fyodor Dostoyevsky, Russian writer 


chapter6 

Understanding SociaI Networks 

Individuals, groups, and organizations also form networks. Such networks 
are called social networks. They are historieally the longest-studied and 
probably the most familiar and intuitive complex networks. Social network 
nodes are explicitly related through friendship, kinship, and membership. 

In this chapter, you wlll leam the taxonomy of social networks and thetr 
edges. You will understand the role of weak and strong edges in information 
dissemination and preservation, and the importance of centrality measures. 
In the end, you will have a glance at synthetic networks and leam why one 
needs them. 

Understand Egocentric and Sociocentric Networks 

Personal social networks are complex networks of persons or social animals. 
(No, whales and elephants do not have their own Facebook, but If you are 
intrigued, look at Animal Social Networks [KJFC15]\) Respectively, nodes rep- 
resent people, and edges represent significant social relationships between 
people: kinship (remember the family tree on page 3?), friendship, acquain- 
tanceship, subordination, and the like. Some of the relationships are typically 
dtrected (subordination, some subtypes of kinship), and others are undirected 
(friendship, acquaintanceship), givlng rise to the namesake graphs. They may 
have different weight [Distinguish Strong and Weak Ties, on page 66), leading 
to weighted graphs. One can include any or all of these relationships in one 
network, ending up with multigraphs and other pseudographs. A social net¬ 
work is truly a complex one! 

And the simplest form of a complex social network is an egocentric network. 
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Egocentric Networks 

An egocentric network (or ego network, for short) is the social network of a 
particular individual. An ego network includes all the individuars contacts 
and all the relationships among them. Using the terminology from Chapter 
5, Case Study: Constructing a Network ofWikipedia Pages, on page 41, an 
ego network includes the nodes from level 1, and the network of subjects 
related to complex networks is the ego network of the subject “complex 
networks.” 

Egocentric networks are used to understand the structure, function, and 
composition of connections around a single person. Unlike sociocentric net¬ 
works, they are bounded and focus on individuals (rather than groups). 

The Central node of an ego network is referred to as ego (as in egoism and 
diter ego]-, all the other nodes are called diter (as in altemative and alter ego, 
again). 

To construet a social ego network, start with an ego—say, yourself. Obtain 
the list of the ego’s contacts—the alters. If you explore a social networking 
website, the list of alters is often called “friends list,” “list of subscribers,” or 
“list of followers.” You can download it by using the site API, by scraping and 
parslng the site’s HTML code, or, rf nothtng else works, by cop 3 Tng and pasting 
the data by hand. 

^ When a Social Network Is Not a Social Network ^ 

When your friends say “social network,” chances are they are using the words wrong. 
For example, Facebook is not a social network. 

Facebook is a social networking website (SNS)—a website that facilitates social net¬ 
working by augmentlng traditlonal offllne, face-to-face Communications wlth Instant 
Online Communications. The dlfference between a social network and a social network¬ 
ing website is llke the dlfference between club members and the club bulldlng: whlle 
It Is easier for the club members to meet m the club bulldlng, the bulldlng Is not 
strictly necessary for the club to function. 

Your mom Is stili your mom and belongs to your ego network, whether she is on 
Facebook or not. 

L J 

If you are a sociologist, anthropologist, or another researcher in the field of 
social Sciences, you may need to deal with real people rather than digital lists. 
Your principle inquiiy tool is probably a name generator (they are described 
m detail in Social Network Anaiysis [KY08] and other SNA-related books). A 
name generator is a list of contacts—alters—prepared on your request by the 
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ego person—the person who wlll be the center of the network. (If you’re 
working on your ego network, you have to make the list yourself.) Often name 
generators have restricted length to facilitate recollection, but ego networks 
derived from shortened lists have less elaborate structure. 

One way or another, digitally or by haud, you will get a list of some or all 
alters. You can arrange them into a star network with the ego at the center 
because all of them are connected to the ego. But thafs not enough. Now you 
have to repeat the contact collection procedure for each of the alters: either 
by calling the APls/scraping/coppng/pasting or by soliciting names through 
name generator surveys. With the median number of friends on Facebook 
being between 155 and 500, depending on whom you trust, the process of 
data collection may become quite daunting, unless properly automated. One 
may only wonder how people researched ego networks before Facebook. (Hint: 
Before Facebook, there was MySpace.^) 

Ego network construction significantly differs from the snowballing process 
on page 7 in the way you treat newly discovered nodes. An ego network does 
not extend beyond the alters. You’re supposed to discard any detected node 
that is not an alter (which is inefficient, but you can save the unwanted nodes 
for the future—say, for a full social network analysis). 

^ You Couid Have Had Your Facebook Ego Network ^ 

The offlclal Facebook appllcatlon programmmg Interface (API) vl.O allowed Facebook 
apps to download your friends’ friend lists. Thafs all you needed to construet your 
ego network programmabcally. Lada Adamlc, a prominent complex network researcher, 
wrote a program called GetNet that used the Facebook APIs to bulld ego networks. The 
program worked well untll May 1, 2015, when Facebook retlred vl.O due to privacy 
coneems. Lucklly, I collected my Facebook ego network back In 2013 (see the flgure 
on page 56). It is four years old but better old than nothing. 



Once you harvest the ego and all the alters, remove the ego from the network. 
It is the center of the giant star with the highest connectivity. It dominates 
the network. It is the tree that makes it hard to see the forest. Removing the 
ego does not cause any tnformation loss: if needed, just put it back and con- 
nect to each existing node. 

As an example of a real ego network, let’s have a look at my Facebook ego 
network constructed in 2013. The graph, anonymized for privacy reasons, is 
shown in the figure on page 56. 


1. myspace.com 
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Here are some faets about it: 

• Each node represents one of my friends, relatives, colleagues or 
acquaintanees. 

• Each edge represents what Facebook calls a “friendship,” but in reality 
can stand for an 5 dhing from “she is my sweetheart” to “what’s-his-name.” 

• The network does not have “my” node because each shown node is 
implicitly connected to me. 

• Some nodes look completely isolated. They are not: each node is connected 
to me, and there may be other contacts to the nodes that are not my dtrect 
contacts. 

• Plnk and blue nodes represent female and male contacts. 

• Some nodes congregate and form dense subgraphs. Some of these sub- 
graphs are all-male, some are almost all-female, and some have mnced 
gender population. The subgraphs represent different aspects of my social 
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life: family, close fiiend circle, current and past jobs, and hobbies. An ego 
network is like a spectroscope that separates your alters into a social 
spectrum. 

• Node size represents betweenness centrality (a measure of node impor- 
tance; see Betweenness Centrality, on page 93). 

This fact goes with a tricky question: which node most likely represents 
my spouse? 

Most of these facts are true about most of the human ego networks, but 
beware: an ego network is only a subgraph of a bigger social network. An57thing 
you measure in an ego network—diameter, centralities, clustering coefficients 
(Chapter 8, Measuring Networks, on page 83) —is an approximation of the 
same measure for the same node in the bigger graph. Let’s next have a look 
at sociocentric networks, where eveiyone is an ego and alter at the same time. 

Sociocentric Networks 

A sociocentric network, or Just a social network, is any social network that 
is not egocentric. Ideally, a sociocentric network is a combination of the ego 
networks of all egos and includes all relevant (whatever it means to you as a 
researcher) alters. For example, a social network of all active Facebook users 
includes all =2.01 billion nodes representing active Facebook members (in 
2017) and =0.25 trillion edges representing their friendships.^ A complete 
social network of all livlng human beings has =7.44 billion nodes; the number 
of edges must be no more than 0.66 trillion if we believe in Dunbar’s number 
[Dun98] —150, the number of indivlduals with whom a person can maintain 
stable relationships. 

A sociocentric network is the prime focus of attention of social network ana- 
lysts. It reveals all significant relationships of each actor tn the network, 
exposes hierarchical groups of actors, and provides a framework for explaining 
the structure and evolution of individual edges and node groups. 

A non-trivial social network, regardless of its size, is a complex network. What 
makes it distinctive is not the size but the interpretation: the social theories 
that stand behtnd the degree distributions, centralities, local network topology, 
community structure, and network evolution. The table on page 58 lists some 
examples of possible social interpretation of complex network properties. 
Some of them will be covered in this book. In this table, 1 call nodes “actors” 
to emphasize their human nature. 


2. bigthink.com/praxis/do-you-have-too-many-facebook-friends 
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NetWork 

property Examples of sociaI interpretation 

Local Structural equivalence: if two actors have similar connec- 

topology tions to other actors, they are similar or equivalent. 

Triadic closure: two friends of an actor eventually become 
friends. 


Degree and 
eigenvector 
centrali ty 

Closeness 
centrali ty 


Betweenness 
centrali ty 

Community 

structure 


Degree 

distribution 


NetWork 

dynamics 


Balance theoiy: a friend of friend is a friend, a friend of a 
foe is a foe, and so on. 

SociaI capital: an actor produces common good for the 
friends. 

Influence: an actor causes a change in behavior in the friends. 
Influence: see above. 

Information dissemination/diffusion: how good are aetors 
in broadcasting or sharing information? 

Information dissemination: see above. 

Brokerage: how good are actors in servlng as “go-betweens”? 
Homophily (cognitive balance): “birds of a feather flock 
together.” 

Knowledge preservation: actors in tightly knit communities 
preserve knowledge. 

Complex contagion: a gang of intercormected infected actors 
is a source of contagion. 

Small World (stx degrees of separation): any two actors on 
average are connected by six “handshakes.” 

Friendship paradox: “my fiiends have more friends than I do.” 
Preferential attachment (Pareto principle): “the [actors] rich 
[in friends) get richer.” 


The table is not complete by any means, but it gives you a sample of sociaI 
research questions and SNA/CNA machineiy typically associated with them. 
If interested, see SociaI NetworkAnalysis: A Handbook [ScoOO] and Exploratory 
SociaI NetWork Analysis with Pcyek (Structural Analysis in the SociaI Sciences) 
[NMBll] for a comprehensive coverage of sociaI network analysis with the 
emphasis on the sociat aspects. 
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Acquisition of SociaI Networks 

Practically, you almost never get a complete social network of interest for 
several reasons. 

1. Real social networks, especially those implemented through social net- 
working websites (see the Aside titled “When a Social Network Is Not a 
Social Network” on page 54), are often huge. Obtaining them may take 
more time than you can afford. 

2. Real social networks are dynamlc. Nodes and edges are added and removed 
as you construet the network. By the time you fetch the last node and edge, 
the rest of the graph may afready be out of S 5 mc with the real network. 

However, you can stili get a decent approximation of a social network using 
either snowball samplmg or random node sampling. 

The Principal difference between two samplmg approaches is the source of 
the nodes to process. A snowballing algorithm maintams a list of nodes that 
have been discovered but not vlsited yet (in other words, the nodes “at the 
other end” of the edges incident to the already visited nodes). The algorithm 
adds newly discovered nodes to the list and removes the discovered nodes 
after vislting them. 

A random sampling algorithm needs an exogenous list of the seed nodes, like: 

1. Actors interested in a particular event or subject (say, the 122,079,386 
Facebook users who like the page of Cristiano Ronaldo, a Portuguese- 
born soccer megastar) 

2. SNS users who are currently online (as once Implemented on Odnoklass- 
niki,^ the second-largest Russlan language SNS) 

3. Persons with randomly chosen numerlcal SNS user IDs (say, 
112424342928081542774 on Google+^) 

4. Persons randomly chosen from a telephone book 

Randomness is the key to successful sampling. If the choice of the seed nodes 
is not random, then a sampling algorithm may go astray and collect an 
unrepresentative part of the network. Frankly speaking, examples 1 and 2 
from the previous list are not entirely random. The first is biased toward the 
actors with particular interests; the second favors the population in a given 
time zone and with a particular online actlvity pattern. 


3. ok.ru 

4. plus.google.com 
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Once you have a list of seed nodes, you can obtaln their ego network graphs, 
eombine them into one large graph, and enjoy a sampled sociocentered net¬ 
work. As an example of what you may end up with, the following figure shows 
165,795 nodes and 434,118 edges of MoiKmg® (a Russian-language counter- 
part of Llnkedln®) that 1 eolleeted in 2009. Different eolors in the figure denote 
different tight network neighborhoods. 



And Just to remind you: what you see in the figure is 0.012 percent of the eurrent 
Facebook population. Some social networks are eomplex by all measures. 

Signed Networks 

Some soeial (and not only soeial) networks belong to a elass of signed (as opposed 
to unsigned) networks. There is not mueh special about them—except that they 
are weighted, and the weights can take negative values. This feature allows 
using the same type of ties to represent both positive and negative aspects of 
relationships. For example, a negatively weighed friendship tie between Alice 
and Chuck may signify that the folks are foes rather than friends. 


5. moikrug.ru 

6. www.linkedin.com 
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Signed networks are dangerous because their visual inspection does not reveal 
the true meaning of signed ties. Any network analysls algorithm that disre- 
gards weights would be fooled into believlng that a tie is an indicator of 
proximity, while in the case of Alice and Chuck, it is just the opposite. 

Nonetheless, some social theories (including the balance theory mentioned 
in the table on page 58) make heavy use of signedness. As a professional, you 
should be ready to handle it. 

Prepared Social Networks 

If you’re not in the mood to crawl a social network yourself or need a typical 
(but not any particular) network for your experiment, you have two more 
options. You can either generate a synthetic network with preset properties 
(Appreciate Synthetic Networks, on page 63) or download an empiric network 
from the Stanford Large Network Dataset Collection compiled by Jure Leskovec 
and Andrej Krevl in 2014. 

The collection provides free access to ninety snapshots of various complex 
networks grouped into seventeen categories. At least thirty-three of them 
describe social and communication networks obtained from Facebook, 
Google+, Twitter, LiveJournal, Slashdot,® and similar sites. We will use one 
of the networks from the collection—Enron email communication network— 
in the next section. 

Another source of publicly available empiric networks is the Koblenz Network 
Collection (KONECT) by Jerome Kunegis, featuring 261 datasets.® 

Recognize Communication Networks 

For most of the world, a communication network is an electrical (digital or 
analog) circuitiy that connects terminal communication equipment, such as 
wired and cellular phones, computers, modems, TV sets and cable boxes, 
and the like. Not so in social network analysis. 

A social communication network is a social network where the edges represent 
a communication relation: “channels through which messages may be 
transmitted” [KY08]. In other words, two actors are adjacent if they directly 
communicate or have a propensity for direct communication. The communi¬ 
cation medium is not of our concem: it could be face-to-face, verbal over a 


7. snap.stanford.edu/data/ 

8. slashdot.org 

9. konect.uni-koblenz.de 
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phone, email, Internet chat, prison tap eode, paper “snail mail,” or even car- 
rier pigeons. As long as an arehetypal Aliee can reliably send a message to 
an arehetypal Bob, an 5 dhing goes. (If you haven’t met Aliee and Bob yet, meet 
them in The Tale of Aliee, Bob, and Chuck, on page 62.) 

Communication networks stili conneet people or soeial animals, but the 
connectedness is derived not from a relatively stable relationship (friendship, 
kinship, or membership), but from short, often instantaneous interactions. 
The interactions may be one-way or two-way; frequent, occasional, or even 
isolated. It is up to you to decide what communication pattern constitutes 
an edge. (I will help you in Distinguish Strong and Weak Ties, on page 66.) 

The most friendly communication medium is corporate emails. In a case of 
major conundrums, corporate email archives may become public. The largest 
public corpus is the Enron email communication network published by the 
Federal Energy Regulatory Commission during the investigation of the com- 
pany’s criminal activity in 2008.^° The network has 36,692 nodes correspond- 
ing to 150 Enron employees and other addressees, and 183,831 edges—direct 
node-to-node email Communications. The edges (and the network) are 
directed. An edge connects the sender and the recipient. The researchers who 
constructed the communication network did not incorporate the mtensity of 
interactions in the published dataset. We cannot teli which edges represent 
strong and weak message exchanges. 

There are some unique issues associated with communication networks. First, 
some communication media allow Information broadeast. For example, a 
speaker at a convention addresses all attendees at once. An email message can 
be sent to many recipients. A public Internet forum post is usually read by 
more than one reader. Assuming that each attendee/reader/recipient is not a 
random passerby, you can model group Communications as a star by connecting 
the speaker/poster/sender to each addressee with a separate edge. 

The Tale of Aliee, Bob, and Chuck ^ 

Allce and Bob are fictional characters frequently used In cryptology and communication 
theoiy to represent communlcating parties. They were Invented in 1978 and lived 
happUy ever after. Aliee usually is the sender: she Inltlates the conversation by 
sending a message to Bob, who is the recipient. Sometimes, the communication is 
marred by Chuck the Bad Guy whose goal is to Intercept, Interrupt, modlfy, or fabri¬ 
cate the message. 

V y 


10. snap.stanfor(j.edu/data/email-Enron.html 
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The second issue is that communication networks are often less rigorously 
doeumented than other types of soeial networks, or it takes more effort to aceess 
doeumentation. Verbal eommunieations are almost never recorded. Telephone 
billing records may be avallable if requested by law enforcement officers, but 
not by soclal network analysts. Internet chat sessions are compulsoiily 
reeorded in some eountries (such as Russia) but, again, not by “us” but by 
“them.” And the free-flying earrier pigeons are provably the worst to traek. 

Appreciate Synthetic Networks 

Synthetie networks are a cheap altemative to real-world, empirie networks. 
Unlike empirie networks that have to be serupulously collected, either auto- 
matieally or by hand, S3mthetie networks are generated by eomputer Software 
(in our ease, NetworkX). With proper adjustment through the right ehoice of 
the parameters of S 3 mthetie graph generators, you can produce networks of 
almost any t 3 rpe, resembling empirie networks to the point of confusion. 

The following figure shows six generated “classic” networks. You saw almost 
all of them (except for the complete graph) in Know Thy Networks, on page 
2. However, that figure was hand-programmed, but the following one is 
produced by the NetworkX graph generator functions. You will learn how to use 
them in Generate Synthetic Networks, on page 78. 
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It is worth remembering that these six networks are not complex because 
they have a predictable, regular, and easily desciibable structure. But the 
networks in the following figure are random, though defined by four different 
random models. (The models are properly described m Network Analysis: 
Methodological Foundations [BE05].) 


Erdds-Renyi (p=0.05) 
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Holme-Kim (k=4, p=0.5) 



An undirected Erdos-Renyi graph, also known as a binomial graph, contains 
N nodes. It could have up to Nx(N-l)/2 edges, but each edge is instantiated with 
the probability of p. As a resuit, the expected number of edges is pxNx(N-l)/2. 
If p==0, then the network falis apart into isolated nodes. If p==l, the network 
becomes a complete graph. Note that in general, a node is not connected to 
its geometric neighbors. 

If you have no definite idea about what kind of network you want, use the 
Erdos-Ren}^ model. However, a Watts-Strogatz graph is a much more realistic 
approximation of a real-world social network. The model arranges N nodes in 
a ring, connects each node to k rlng neighbors, and then “rewires” any edge 
—reconnects one of its ends to a randomly chosen node—with the probability 
of p. The rewired edges typically go across the ring. They create an illusion of a 
“small World,” where geometricaUy remote nodes may be connected with a short 
path. The model explains the phenomenon of “six degrees of separation,” which 
claims that, on average, any two people on Earth are only six handshak.es apart 
Jrom each other [Wat03]. 
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Unfortunately, no matter how you twist them, nodes In Watts-Strogatz net- 
works do not form tight communities, and this makes “small-world” networks 
somewhat unrealistic, too. The Barabdsi-Albertpreferential attachment model 
[Bar03] adds another level of realism to synthetic social networks. It uses the 
principle of preferential attachment mentioned briefly on page 5 and more 
in detail in WhatMak.es Components Giant?, on page 129. When a new node 
is about to join an existing network, it is likely to make k connections to the 
nodes with the highest degree. The model stimulates the emergence of hubs 
—“celebrity” nodes with disproportionately many connections. 

The Holme-Kim model goes one step further. After adding k edges, it also adds 
triads (introduced on page 6) with the probability of p, making the S 3 mthetic 
network even more clustered and lifelike. 

Over the long histoiy of social network analysis, several empiric networks 
were so frequently used in case studies that they became the “gold Standard” 
of a small social network, veiy much like the “Helio, world!” programs in 
computer programming. For any practical purpose, they are almost as good 
as synthetic networks, except that they are really tiny. (Butyou carmot blame 
the researchers who constructed them! It was the time before NetworkX and 
even before personal computers.) The following figure shows three famous 
social networks: Zachaiy’s Karate Club, Davis Southern women and the events 
they attend, and marriages m Florentine families. 
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Whether you assembled a social network by hand, by running a program, or 
by downloading a publicly available network graph from SNAP or another 
depository, remember: no two friendships, no two kmships, and no two human 
interactions are the same. That is to say, no two links in a social network are 
the same, but some are weak, and some are strong. Why does it matter? 

Distinguish Strong and WeakTies 

Hardcore social network researchers call social network graph edges “ties.” 
In this section, 1 will sometimes refer to edges as ties—mostly out of respect 
to Mark Granovetter who introduced [Gra73] and elaborated [GraSS] the concept 
of weak and strong ties and later showed that weak ties are “strong.” In a 
sense, he was the first social researcher to consider weighted social networks. 

Granovetter did not propose a mechanism for quantifytng the strength of a 
tie, but offered four criteria to consider: 

• The amount of time spent in the tie 

• Emotional mtensity 

• Intimacy (mutual trust) 

• Reciprocity 

You may be able to evaluate the first and the last criteria, but the middle two 
require looking into the contents of the messages, which cannot be done if 
only message signatures are available. Nonetheless, combined with sentiment 
analysis and other natural language processing techniques (see, for example, 
Natural Language Processing with Python [BKL09]), it may be possible to 
develop a formal mechanism for assesstng tie strength. 

Granovetter argues that weak and strong ties have different vocations in social 
networks. Strong ties (such as those between spouses and close friends) tend 
to draw nodes together into tight, densely interconnected clusters—eliques 
(Extract Cliques, on page 131). Cliques often act as “knowledge reservoirs”: 
an 5 dhing said by any actor in a clique is overheard and presumably remem- 
bered by all other clique members. If any clique member forgets an 3 d;hing, 
the clique as a whole can easily reconstruet the missing knowledge. 

Cliques are great at knowledge preservation but not good at knowledge gen- 
eration. Since different cliques preserve different t 5 qDes of knowledge, connec- 
tions between the “reservoirs” enable sharing. Such links are called bridges. 
You will learn how to detect bridges in Betweenness Centrality, on page 93. 
Not surprisingly, weak ties often serve as bridges, and thafs why they are 
“strong.” And, regardless of their function in a social network, from complex 
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network analysis’ perspective, both weak and strong ties are just weighted 
edges wlth appropriately selected weights. 

Social networks are a special breed of complex networks, limited to individuals 
(humans and anlmals) and organizations. They have been extensively studied 
in social and behavioral Sciences. Social network analysis served as a precur- 
sor to complex network analysis. Complex networks often have thousands 
and millions of nodes, and depend on edges having different weights. In the 
next chapter, you will unlock NetworkX tools for the efficient construction of 
massive networks, including S 3 mthetic and weighted networks. 
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Hereyou see we have a very advanced form ofdrawing, and a 
form Ishouidnot advise you to employ in your early efforts to do 
professional work. 

Ernest Knaufft, American editor and director ofthe 
Chautauqua Society ofFine Arts 

CHAPTER 7 


Mastering Advanced NetWorkConstruction 


Complex networks are rarely constructed one node and one edge at a time. 
Instead, they are generated from matrix data, edge lists, node dlctionaries, 
probability distributions, and other native Python and third-party data 
structuros. As a complex network analyst, you need to be famillar with the 
NetworkX interfaces to the real world. 


In this chapter, you will learn how to convert Python and third-party data 
structuros (namely, Pandas DataFrames and Pandas NumPy matrices) into NetworkX 
graphs and back, and how to generate synthetic networks. 


Create Networks from Adjacency and Incidence Matrices 

Mathematical graphs as collections of nodes and edges are 
not the only way to represent complex networks. Researchers 
Pandas, NumPy. practitloners often use tabular (matrix) data to describe 

networks. The two most popular matrlx-based descrlptlons 
are adjacency and Incidence matrices. (You may want to remind yourself of 
the definitions of adjacency and incidence on the bulleted list on page 17.) 


Adjacency Matrix, the Python Way 

An adjacency matrix A Is a square NxN matrix, where N is the size of the graph 
to be defined. The row and column indexes indicate the source and target 
nodes, respectively. Depending on the network type, the acceptable range, 
properties, and interpretation of the matrix elements differ. If a network 
belongs to more than one type (say, weighted and directed), consider all rele¬ 
vant properties and interpretations (see table on page 70). 
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NetWork type 

Adjacency matrix 

Interpretation 

Simple 

Only Os or Is, and no Is on 
the main diagonal 

Absence/presence of an 
edge 

Weighted 

At least one floating-point 
number 

Numbers are edge weights. 

With self-loops 

At least one non-zero on the 
main diagonal 

Same as above 

Signed 

At least one negative number 

Same as above 

Undirected 

Sjnnmetric 

Same as above 

Multigraph 

Not possible 

Cannot be represented as 
an adjacency matrix 


As an example, here’s an adjacency matrix for the linear timeline of Abraham 
Lincoln from the figure on page 4 (left) and another very similar network 
(right): 


01000 01000 
00100 00100 
00010 00010 
00001 00001 
00000 10000 

The networks have five nodes each (the matrices are 5x5). The left network 
has four edges (the matrix has four Is), and the right network has an extra 
edge. The networks are simple (the matrices have only Os and Is, and no Is 
are on the main diagonal), unweighted, and unsigned. The networks are 
directed (the matrices are not symmetric). The additional 1 in the lower-left 
comer of the matrix on the right converts the linear network into a ring by 
connecting the last event (death) to the first event (birth). If Abe Lincoln had 
believed in reincarnations, he would have chosen the right matrix. 

As a side note, the sum of all Is in any column or row of an adjacency matrix 
equals the indegree or outdegree, respectively, of the corresponding node. 

The most common way of representing matrices in pure Python is in the form 
of a list of lists. The right previous matrix is a list of five lists, one list per row. 
Suppose it is given to us (say, produced by another function elsewhere in our 
program): 


[[ 0 , 

1 , 

0 , 

0 , 

0 ], 

[ 0 , 

0 , 

1 , 

0 , 

0 ], 

[ 0 , 

0 , 

0 , 

1 , 

0 ], 

[ 0 , 

0 , 

0 , 

0 , 

1 ], 

(. 1 , 

0 , 

0 , 

0 , 

0 ]] 
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Note that since the relncamation is not Inevitable at all, not even in the case 
of Honest Abe, we set the weight of the death-to-birth edge to 0.1. 

How can we convert thls matruc to a graph? There are at least three ways: 
one uses pure P37thon and NetworkX, and the other two rely on NumPy (When 
Python Goes Numerical, on page 71) and Pandas {AnotherKind of Pandas, on 
page 73). 

^ When Python Goes Numerical ^ 

Surely, Python, just llke all other computer Software, mtemally works wlth numbers, 
only numbers, and nothlng but numbers. However, when there are too many numbers 
to Work wlth, Python performance slgnificantly degrades. NumPy (“Numerical P 3 rthon”) 

Is a package for sclentlflc computlng that accelerates fundamental numerical opera- 
tlons. It provides support for multldlmenslonal objects (such as vectors and matrices), 
vectorized arithmetlc and algebralc operatlons, and other goodles. NumPy Is a part of 
the SciPy (“Sclentlflc Python”) ecosystem that also Includes Pandas for data Science, 
Matplotiib for plottlng, Sympy for s 5 rmbollc computatlons, and iPython for Interactive 
development. 

V J 

If performance is not an issue (if your network has fewer than a couple of 
thousand nodes), the pure Pyhhon solution may make you feel more comfort- 
able, especially If you have never used NumPy. Remember that any non-zero 
element in the adjacency matrix represents an edge from the “row node” to 
the “column node.” Create an empty drrected graph, enumerate each matrix 
element twlce (by rows and then by columns), and extract non-zero elements. 
Their indexes represent network edges, which you can add to the graph by 
calling G.add_edges_from(). 

from itertools import chain # For flattening the list of edges 
G = nx.DlGraphl) 

edges = chain.from_iterable([(i, j) 

for j, column in enumerate!row) 
if A[i][j]] for i, row in enumerate(A)) 

G.add_edges_f rom(edges) 
print(G.edges(data=True) ) 

< [(0, 1, {}), (1, 2, {}), (2, 3, {}), (3, 4, {}), (4, 0, {})] 

By default, NetworkX assumes that all edges have the weight of 1, and does not 
display weights as edge attributes. If the matrix represents signed or unsigned 
weights (rather than absence/presence), you can modlfy the code to incorpo¬ 
rate the “weight” attribute: 
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from itertools import chain # For flattening the list of edges 
G = nx.DiGraph() 

edges = chain . f rom_iterable( [ (i, j, {“weight"-. A[i][i]}) 

for j, column in enumerate{row) 
if A[i][j]] for i, row in enumerate(A)) 

G.add_edges_f rom(edges) 
print(G.edges(data=True) ) 

< [(0, 1, {'weight': 1}), (1, 2, {'weight': 1}), (2, 3, {'weight': 1}), 

(3, 4, {'weight': 1}), (4, 0, {'weight': 0.1})] 

Adjacency Matrix, the NumPy Way 

The NumPy way is somewhat more coneise, but you must convert the list of 
lists to a 2D matrtx and give NetworkX a hint about the network type. 

import numpy as np 
Amtx = np.matrix(A) 

G = nx.from_numpy_matrix(A_mtx, create_using=nx.DiGraph()) 
print(G.edges(data=True) ) 

< [(0, 1, {'weight': 1}), (1, 2, {'weight': 1}), (2, 3, {'weight': 1}), 

(3, 4, {'weight': 1}), (4, 0, {'weight': 0.1})] 

As a bonus, the NumPy way is significantly faster for large networks. Also, note 
how NumPy intelligently treated matrix elements as edge weights! 

You can program the reverse transformation with nx.to_numpy_matrix(G): 

Bmtx = nx.to_numpy_matrix(G) # Produces a NumPy 2D matrix 
print(B_mtx) 


< [[ 0 . 

1 . 

0 . 

0 . 

0 .] 

[ 0 . 

0 . 

1 . 

0 . 

0 .] 

[ 0 . 

0 . 

0 . 

1 . 

0 .] 

[ 0 . 

0 . 

0 . 

0 . 

1 .] 

[ 0.1 

0 . 

0 . 

0 . 

0 .]] 


To eonvert the matrix back to a list of lists, call method tolistO: 

B_lst = B_mtx.tolist() 
print(B_lst) 

< [[ 0 . 0 , 1 . 0 , 0 . 0 , 0 . 0 , 0 . 0 ], [ 0 . 0 , 0 . 0 , 1 . 0 , 0 . 0 , 0 . 0 ], 

[ 0 . 0 , 0 . 0 , 0 . 0 , 1 . 0 , 0 . 0 ], [ 0 . 0 , 0 . 0 , 0 . 0 , 0 . 0 , 1 . 0 ], 

[ 0 . 1 , 0 . 0 , 0 . 0 , 0 . 0 , 0 . 0 ]] 

By the way, you may have just leamed how to program with NumPy. 
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Adjacency Matrix, the Pandas Way 

The most versatile connection to date is between NetworkX and Pandas. If you 
consider integrating CNA with general data seientific methods—regression, 
prediction, and classifieation—^you must be aware of the interface between 
NetworkX and Pandas. 

^ Another Kind of Pandas ^ 

Pandas are cute. Pandas is also yet another component of the SciPy (“Sclentlfic P}rthon”) 
ecosystem. Its maln applicatlon is data Science, and it provides a basketful of data 
structures and algorlthms for storing and processlng labeled rectangular data. The 
most famous Pandas data structures are a Series (a labeled vector) and a DataFrame (a 
labeled table). You can read more about Pandas and NumPy tn Data Science Essentials 
in Python [Zinl 6]. 

V J 

Converting a NetworkX graph to a Pandas adjacency matrlK costs one function 
call, just like almost any other popular operation in Pandas. Before we do 
so, let’s first relabel the graph nodes to allow at least some meaningful 
interpretationi 

labeis = "Born", “Marrled" , “Elected Rep" , “Elected Pres" , "Died" 

nx.relabel_nodes(G, dict(enumerate(labels)), copy=False) 

df = nx.to_pandas_dataframe(G) 

print(df ) 

print(type{df )) 




Died 

Elected Rep 

Married 

Born 

Elected Pres 

Died 


0.0 

0.0 

0.0 

0.1 

0.0 

Elected 

Rep 

0.0 

0.0 

0.0 

0.0 

1.0 

Married 


0.0 

1.0 

0.0 

0.0 

0.0 

Born 


0.0 

0.0 

1.0 

0.0 

0.0 

Elected 

Pres 

1.0 

0.0 

0.0 

0.0 

0.0 

<class 

'pandas 

. core 

.frame.DataF 

rame'> 




Discussing the uses of DataFrame objects is beyond the scope of this book; tiy 
this adventure on your own! 

Suddenly, function nx.from_pandas_dataframe(df,source,target) is not a counterpart 
of nx.to_pandas_dataframe() in the same sense as nx.from_numpy_matrix(df) is a coun¬ 
terpart of nx.to_numpy_matrix(). The first parameter of nx.from_numpy_matrix() is a 
DataFrame that represents the adjacency matrix. The first parameter of 
nx.from pandas dataframeO is a DataFrame, too, but each row defines one edge, and 
two of the columns, source and target, designate the start and end nodes of the 
edge. fYou can convert the remaining columns into edge attributes.) This 
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approach is more flexible than the adjacency matrix. For example, it easily 
allows parallel edges. 

Let’s build Honest Abe’s lifetime network from a data frame. First, create a 
data frame—in the real world, it would be an output of another part of the 
same program. Second, call nx.from_pandas_dataframe(). No magic involved. 

import pandas as pd 
df = pd.DataFrame({ 

"from": {0: "Died" , 1: "Elected Rep" , 2: "Married" , 3: "Born" , 

4: "Elected Pres"}, 

"to": {0: "Born", 1: "Elected Pres", 2: "Elected Rep", 3: "Married" , 

4: "Dled"}, 

"weight"-. {0: 0.1, 1: 1.0, 2: 1.0, 3: 1.0, 4: 1.0}, 

}) 

print(df ) 



from 

to 

weight 

0 

Died 

Born 

0.1 

1 

Elected Rep 

Elected Pres 

1.0 

2 

Married 

Elected Rep 

1.0 

3 

Born 

Married 

1.0 

4 

Elected Pres 

Died 

1.0 


G = nx.from_pandas_dataframe(df, "from", "to", edge_attr=[ 'Veigdt" ]) 
print(G.edges(data=True) ) 

< [('Born', 'Married', {'weight': 1.0}), ('Born', 'Died', {'weight': 0.1}), 

('Married', 'Elected Rep', {'weight': 1.0}), 

('Elected Pres', 'Died', {'weight': 1.0}), 

('Elected Pres', 'Elected Rep', {'weight': 1.0})] 

By the way, you may have just leamed how to program with Pandas. 

Handiing Node Attributes, the Pandas Way 

Another case of collaboration between Pandas and NetworkX is importing node 
attributes into a DataFrame. In the course of CNA, you often decorate network 
nodes with various attributes: labeis, weights, centralities, demographics 
(age, gender), and the like. For the sake of experimentation, let's add a "date" 
parameter to Lincoln’s timeline: 

events = {"Died": 1865, "Born": 1809, "Elected Rep": 1847, 

"Elected Pres": 1861, "Married" : 1842} 
nx.set_node_attributes(G, "date", events) 
nodedata = G.nodes(data=True) 
print(node_data) 

< [('Died', {'date': 1865}), ('Elected Rep', {'date': 1847}), 

('Married', {'date': 1842}), ('Born', {'date': 1809}), 

('Elected Pres', {'date': 1861})] 
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The values of node attributes, especially those calculated by the analysis 
program, may become inputs to further data analysis steps, as mentioned in 
Adjacency Matrix, the Pandas Way, on page 73. If you’re a Pandas person, you 
should move the node attributes from NetworkX to a DataFrame. Luckily, one of 
the DataFrame constructors builds a DataFrame from a list of tuples, and node data 
is a list of tuples. 

lincolnser = pd.DataFrame(node_data).set_index(0)[1] 
print (lincoln_ser) 

< 0 


Died 


{'date': 1865} 

Elected 

Rep 

{'date': 1847} 

Married 


{'date': 1842} 

Born 


{'date': 1809} 

Elected 

Pres 

{'date': 1861} 


Name: 1, dtype: object 

After converting the node labeis to the row index, the resulting DataFrame has 
only one eolumn named 1 (which, naturally, is a Series). The values in the 
column are node attribute dictionaries, and one of the Series constructors 
builds a Series from a dictionaiy. Let’s apply the constructor to each row. 

df = lincoln_ser.apply(pd.Series) 
print(df ) 


< date 
0 

Died 1865 
Elected Rep 1847 
Married 1842 
Born 1809 


Elected Pres 1861 

The resuit is a DataFrame suitable for further processing. For example, you can 
calculate the duration, in years, of each span of Lincoln’s biography: 


spans = df.sort_values(' date ').diff() 
print(spand) 

< date 

0 

Born NaN 

Married 33.0 

Elected Rep 5.0 

Elected Pres 14.0 

Died 4.0 


(NaN, “not a number,” is a Pandas way of reporting a missing or otherwise 
unavailable value.) 
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Incidence Matrix 

An incidence matrix J is a rectangular NxM matrix, where N is the number of 
nodes and M is the number of edges. A 1 at J[i,j] means that the node i is inci¬ 
dent to the edge j. Ali other elements of J are Os. If the represented graph is 
directed, the start node is deslgnated wlth 1 and the end node with -1. 

Unlike an adjacency matrix, an incidence matrix easily allows parallel edges. 
However, it has its weak points: weighted networks cannot be represented, 
and an incidence matrix of a t 5 tpical complex network has a larger memoiy 
footprint than the adjacency matrix of the same network. 

Function nx.incidence_matrix(G) returns the incidence matrtx of G as a so-called 
sparse matrix. (Pass the optional parameter oriented=True to distinguish start 
and end nodes.) You can convert a sparse matrix to a dense one with G.todenseO: 

J = nx.incidence_matrix(G, oriented=True).todense() 

print(J) 


[[- 1 . 

0 . 

0 . 

0 . 

1 .] 

[ 1 . 

- 1 . 

0 . 

0 . 

0 .] 

[ 0 . 

1 . 

- 1 . 

0 . 

0 .] 

[ 0 . 

0 . 

1 . 

- 1 . 

0 .] 

[ 0 . 

0 . 

0 . 

1 . 

- 1 .]] 


Here’s how we read the results: edge number 0 starts at node 1 (because 
J[1,0]==1) and ends at node 0 (because J[0,0]==-1); edge number 1 starts at node 
2 (because J[2,l]==l) and ends at node 1 (because J[l,l]==-1), and so on. 

Work with Edge Lists and Node Dictionaries 

You do not have to mess with matrices, NumPy, and Pandas to bulk move data 
between your code and NetworkX networks. You can use edge lists and node 
dictionaries. 

Edge Lists 

An edge llst is a list of 3-tuples containing the start node, end node, and a 
dictionaiy of edge attributes for each edge. You can obtain it from an exlsting 
network by calling nx.to_edgelist() or construet it yourself and feed as the 
parameter to nx.from_edgelist() to produce a new network. 

edges = nx.to_edgelist(G) 

F = nx.frori_edgelist(edges, create_using=G) 
print(F.edges(data=True) ) 
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< [('Born', 'Married', {'weight': 1}), ('Married', 'Elected Rep', 

{'weight': 1}), ('Elected Pres', 'Died', {'weight': 1}), 

('Died', 'Born', {'weight': 0.1}), {'Elected Rep', 'Elected Pres', 

{'weight': 1})] 

Incidentally (or not), the value retumed by G.edges(data=True) ( on page 21) is 
equivalent to the value retumed by nx.to_edgelist(). That is to say, one of the 
two functions are redundant. 

The pair of the edge list-related functions is reversible: a graph A, created from 
an edge list extracted from another graph B, is equal to B. Equality of graphs 
in mathematical graph theoiy is called isomorphism. Two graphs are isomor- 
phic rf you align all of the nodes of one graph with all of the nodes of the 
other graph, and all of thetr edges will align, too. This property is good enough 
for the graphs with unlabeled nodes but too weak for real-world labeled 
graphs. Yet, for the lack of a better tool, let’s use the function nx.isJsomorphicO: 

print(nx. is_isomorphic(F, G)) 

< True 

Dictionary of Lists 

A dictionaiy of lists of nodes is what it says it is. All nodes in a graph are the 
keys, and lists of adjacent nodes are values. You can get a dictionaiy of lists 
with nx.to_dict_ofJists(): 

dict_list = nx.to_dict_of_lists(G) 
print(dict_list) 

< {'Born': ['Married'], 'Married': ['Elected Rep'], 'Elected Pres': ['Died'], 

'Died': ['Born'], 'Elected Rep': ['Elected Pres']} 

nx.to dict ofJistsO does not externalize edge attributes, including width, and this 
makes the resulting dictionaiy unsuitable for recreating the original graph 
with nx.from_dict_ofJists(). It is true that the new graph is isomorphic to the 
source, but the function nx.isJsomorphicO looks only at the topology of the graphs 
and does not compare the attributes. 

F = nx.from_dict_of_lists{dict_list, create_using=G) 
nx.is_isomorphic(F, G) 

< T rue 

The dictionaiy-of-lists mechanism does not appear to be well thought out. 
You may be better off with accessmg the network edge dictionaiy (which is a 
dictionaiy of dictionaries) directly: 
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print(G.edge) 

< {'Born': {'Married': {'weight': 1}}, 'Married': {'Elected Rep': 

{'weight': 1}}, 'Elected Pres': {'Died': {'weight': 1}}, 

'Died': {'Born': {'weight': 0.1}}, 'Elected Rep': {'Elected Pres': 

{'weight': 1}}} 

Generate Synthetic Networks 

You have read in Appreciate Synthetic Networks, on page 63, that not only 
can networks be built from expeiimental, real-world data, but they can also 
be synthesized. Synthetic networks can be regular (constructed by executing 
deterministic algorithms) or complex (emerge from probability distributions). 
NetworkX functions that build S 3 mthetic network graphs are called graph gen- 
erators (not to be confused wlth Pjdhon generator objects). 

In 1999, Ronald C. Read and Robin J. Wilson published a collection of 10,000 
small synthetic graphs [RW99]. NetworkX provides generators for 1,253 of them 
—and about 110 more regular (“classic”) and complex networks. At the 
moment, it suffices to look only at the graph generators whose output is 
shown in the figures on page 63, on page 64, and on page 65. Let’s start with 
the “classic” networks: paths, cycles, stars, complete graphs, trees, and giids. 

generators.py 

# Generate and draw classic networks 
G0 = nx. path_graph(20) 

G1 = nx. cycle_graph(20) 

G4 = nx. star_graph(20) 

G5 = nx.complete_graph(20) 

G2 = nx. balanced_tree(2, 5) 

G3 = nx. grid_2d_graph(5, 4) 

names = {“Linear (Path)", “Ring (Cycle)", “Balanced Tree", "Mesh (Grid)" , 
"Star", “Complete") 

The first four functions need to know the total number of nodes. There is only 
one way to generate the edges for these t 5 rpes of graphs. For a balanced tree, 
you must provide the branching factor r (the number of children of a non-leaf 
node) and the height h (the height does not include the root node of the tree). 
A balanced tree has nodes. In our example, G2 is a five-level binaiy tree 
with 2®‘^^-l=63 nodes. To build a two-dimensional grid (mesh) like G3, specify 
the number of rows n and columns m, and get a graph with mxn nodes. 

The next example shows the use of generators for Erdos-Ren}^ (really random), 
Watts-Strogatz (small world), Barabasi-Albert (preferential attachment), and 
Holme-Kim (enhanced preferential attachment) random graphs. 
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generators.py 

# Generate and draw random networks 

G0 = nx. erdos_renyi_graph(50, 0.05) 

G1 = nx.connected_watts_strogatz_graph(50, 4, 0.5 ) 

G2 = nx. barabasi_albert_graph(50, 4 ) 

G3 = nx. powerlaw_cluster_graph(50, 4, 0.5 ) 

names = ("Erdds-Renyi (p=0.05)", "Watts-Strogatz (k=4, p=Q.5)“ , 

" Ba rabas i-Albe rt (k=4)" , "Holme-Kim (k-4, p=B.5)'') 

All four functions need to know the total graph size. The remalning parameters 
characterize the random nature of the interconnecting edges: 

• For Erdds-Renyi: the probability of edge creation. Incidentally, it equals 
the graph density [Start with Global Measures, on page 83). 

• For Watts-Strogatz: the initial number of neighbors and the probability 
of edge rewiring 

• For Barabdsi-Albert: the number of edges to attaeh from a new node 

• For Holme-Kinx the same as above, plus the probability of addlng a triangle 
for each added edge 

The remaining three generators produce “famous” social networks that were 
initially constructed by field sociologists based on experimental data, but 
eventually became “gold standards” of social network research. 

generators.py 

# Generate and draw famous social networks 
G0 = nx.karate_club_graph() 

G1 = nx.davis_southern_women_graph() 

G2 = nx.florentine_families_graph() 

names = {"Zachary's Karate Club", "Davis Southern women" , 

“Florentlne families") 

The networks, though formerly random, are stored by NetworkX in the form of 
flxed edge lists. Their generators need no parameters. 

Slice Weighted Networks 

Lucky network analysts work with unweighted networks. In an unweighted 
network, all edges are equal. You consider either all of them, and get what 
you get—or none of them, and get a network with no edges. 

Unlucky network analysts work with weighted (and possibly signed) networks. 
In a weighted network, some edges are strong, and some are weak. If you 
keep all edges, you will have a distorted view of the network because there 
are algorithms that do not discriminate edges by weight. For them, an edge 
with a weight of 1.00 (to your best life-long friend) has the same importance 
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as another edge with a welght of 0.01 (to the guy who takes the same 7:00 a.m. 
bus, always sits In the back, and reads AlaskaDispatchNews). 

Most network analysts are unlueky and have to sliee their networks. 

Slieing is the process of eliminating low-strength edges (weak ties). In the 
simplest form, you choose a cut-off threshold T that Controls the density of 
the resulting network. Each edge’s welght is compared to the threshold. If 
the welght is at or above the threshold, the edge remains in the network; 
otherwlse, it is erased. 

NetworkX does not provide a Standard slieing routine, but you can quickly 
implement yours (wlll do later). However, first, you should decide on the value 
of T. If the cut-off is too high, the network falis apart into tiny disjoint frag- 
ments; if it is too low, the network becomes a hairball with no analyzable 
structure. The trial-and-error approach may be the best: 

1. Select a T based, say, on the edge welght distribution. 

2. Sliee the network. 

3. Get some measurements (the number of fragments, density, and so on). 

4. If the results do not suit you, go back to square one. 

The followlng figure shows the same Erdos-Ren}^ random network with one 
hundred nodes, sliced with stx different thresholds. 
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The final value of T depends on the network topology and weight distiibutlon. 

Our new function, slice_network(G, T, copy=True), retrieves all weighted edges from 
the network G, identifies “underwelghted” edges, and removes them. The 
function hy default operates on the copy of G, allowing us to step back and 
give another tiy without rebuildmg G. 

slicing.py 

def slice_network(G, T, copy=True): 


Remove all edges with weight<T from G or its copy. 


F = G.copyO if copy else G 
F. rennove_edges_f rom( (nl, n2) for nl, 
if w < T) 


return F 


n2, w in F.edges(data='Veig/7t") 


F = slice_network(G, 0.9) 
print(F. edges0) 

< [('Elected Rep', 'Elected Pres'), ('Married', 'Elected Rep'), 

('Born', 'Married'), ('Elected Pres', 'Died')] 

The possibility of reincamation is no more. 

Now you know how to create a complex network of any size from any dataset 
and how to convert the network structure to the most popular pure Python and 
third-party data structures. We will next look into network measuring tools. 
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Considering the value fordearness ofthoughtofcounting, measuring and 
weighing, it is not surprising to find that in the seventeenth century, and 
even at theend ofthe sixteenth, the advance ofthe Sciences was accompa- 
nied by increased exactness of measurement and by the invention of 
instruments ofprecision. 

Walter Libby, American writer CHAPTER 8 

Measuring Networks 

Almost eveiythlng you have seen so far in this book has been about construct- 
ing complex networks, not about analyzmg them. In other words, it was CN, 
but not CNA. This chapter delves into CNA and introduces some important 
CNA toolsets. You will learn how to measure dyadic, triadic, and global 
properties of network nodes: distances, loops, clustering coefficient, assorta- 
tivity, and a variety of centralities. You will be able to identtfy the most Central 
nodes and interpret their importance. You will be able to locate network 
regions that differ in local density and attribute uniformity. 

Start with Global Measures 

Let’s start with a “black box” view of a complex network. Let’s pretend we are 
at a distance and instead of nodes, edges, and their attributes, we see a fuzzy 
grayish cloud. What can we teli about that cloud? Not much: only its size and 
density. 

To be specific, in this chapter, we will experiment with the network of CNA- 
related Wikipedia pages constructed m Chapter 5, Case Study: Constructing 
aNetwork of Wikipedia Pages, on page 41 and available in the file cna.graphml.^ 

The size of a network is either its node count or edge count. You can measure 
both using the Standard Pjdhon len() function and other specialized functions. 

len(G) # Number of nodes 

len(G.node) 

len(G.nodes()) 

nx.number_of_nodes(G) 

len(G.edge()) 

< 2988 


1. pragprog.com/titles/dzcnapy/source_code 
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len(G.edges()) # Number of edges 
nx.number_of_edges(G) 

< 11545 

Note that len(G.edge()) returns the number of nodes, not edges, because G.edgeO 
returns a dietionaiy of neighbors with one entry per node. 

If you’re eurious, you can also get the number of non-existent edges—the 
edges that eould eonnect two nodes but don't. The function nx.non edgesO 
returns a P 3 dhon generator of missing edges. Before measuring it, you must 
convert it to a list. Beware: most real-life graphs have orders of magnitudo 
more missing edges than present edges. Your computer may quickly run out 
of memoiy if you attempt to make a list out of them. 

len(list(nx.non_edges(G))) 

< 8913611 

Graph density measures the fraction of existing edges out of ali potentially 
possible edges. Density is a number between 0 and 1, inclusive. A network 
with density 0 has no edges whatsoever. A network with density 1 is a complete 
graph. For a directed network with n nodes and m edges, density is calculated 
as m/(n(n-l)): for undirected networks, it is calculated as 2m/(n(n-l)), because, 
compared to directed networks, they have only half of potentially possible 
edges. You can measure density by callmg a namesake function. 

nx.density(G) 

< 0.0012935348132850563 

The density of the Wikipedia network is low—only about 0.1 percent. Only 
one out of about 1,000 possible edges exists in the graph. This value is not 
unusual: most complex networks have similarly low density. 

Explore Neighborhoods 

Node and edge counts and density are some of the macroscopic network 
properties. Let us now zoom into a network and look at it at the microscopic 
level—at the level of individual nodes and their neighbors. 

The network neighborhood of a node is the set of all nodes adjacent to that 
node. Social network analysis pays particular attention to neighborhoods 
because that is where we find the relatives, close friends, and colleagues of 
the actor represented by the Central node—in other words, the most socially 
significant alters of the ego. (Check Egocentric Networks, on page 54, to refresh 
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the meanlng of the emphasized words.) Neighborhoods are responsible for 
the loeal properties of network graphs. 

NetworkX offers two mechanisms for caleulating neighborhoods. (To be speeiftc, 
let’s eompute the neighborhood of the node ego="Neighbourhoocl (Graph Theory)".) 

• Use the implicit dictionaiy representation of the graph. Node names are 
keys, and adjacent node dictionaries are values. 

altersl = G[ego] 
print(altersl) 
print(len( altersl)) 

< {'Turan Graph': {}, 'Isolated Vertex': {}, 'Adjacency List': {}, 

'Graph (Discrete Mathematics)': {}, 'Complement Graph': {}, 

'Journal Of The Acm': {}, 'Triangle Free Graph': {}, 'Dense Graph': {}, 
'Vertex (Graph Theory)': {}, 'Loop (Graph Theory)': {}, 

'Linear Time': {}, 'Planar Graph': {}, 'Vertex Figure': {}, 

« . . .» 

'Independent Set (Graph Theory)': {}, 'Claw Free Graph': {}, 

'Discrete Mathematics (Journal)': {}, 'Cycle Graph': {}} 

35 

The empty dictionaries {} would hold edge attributes if the network had 
edge attributes. 

• Call the function nx.all_neighbors(). The function retums a generator object 
that you can convert to a list. However, if you expect a node to have too 
many neighbors and you do need all of them at once, keep the neighbor¬ 
hood in the generator form until later. 

alters2 = list(nx.all_neighbors(G, ego)) 

print(alters2) 

print(len(alters2) ) 

< ['Watts And Strogatz Model', 'Network Science', 'Spatial Network', 

'Scientific Collaboration Network', 'Semantic Network', 

'Barabasi-Albert Model', 'Reciprocity (Network Science)', 

'Biological Network', 'Clustering Coefficient', 'Pavol Hell', 

«. . .» 

'Graph Isomorphism', 'Modular Decomposition', 'Planar Graph', 

'Vertex Figure', 'Independent Set (Graph Theory)', 'Cycle Graph', 
'Discrete Mathematics (Journal)', 'Claw Free Graph'] 

65 

Neither neighborhood contains the ego node itself. Such neighborhoods are 
called “open.” The figure on page 86 shows the out-neighborhood of ego: yet 
another star. Don’t forget that the gray rectangles are Matplotlib’s idea of arrows. 
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Note that the two methods report a different number of nodes in the neighbor- 
hoods altersl and alters2: 35 and 65, respeetively. Recall that the network G is 
directed. The first method retums only the neighbors reachable by the outgoing 
edges—the out-neighborhood. The second method returns all adjacent nodes, 
regardless of the direction of adjacency. Which method to use depends on 
which resuit you’re looking for. 

A neighborhood is a dyadic structure. It’s deftned in terms of connections 
between two nodes: the ego and an alter. Aside from serving as a reference 
to the ego’s inner circle, it conveys little information. For example, it doesn’t 
teli if and how its members are interconnected. Adding the chord edges 
transforms the sparse neighborhood into an egocentric network [Egocentric 
Networks, on page 54). Call function nx.ego_graph() to obtain the egocentric 
network graph. 

egonet = nx.ego_graph(G, ego) 

The figure on page 87 shows the egocentric network of ego. The network is 
much denser. You can see that some nodes are connected only to the hub 
(and possibly to some more remote nodes), while others form triangles that 
involve more neighborhood members. 
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Clustering Coefficient 

Some social theories eonsider triads essential units of soeial network analysis. 
Funetion nx.clustering(G, nodes=None) caleulates the clustering coefficient—a 
measure of the prevalence of triangles in an egocentric network. The clustering 
coefficient is the fraction of possihle triangles that contain the ego node and 
exist. This measure is undefined for directed graphs; you must coerce a 
digraph to an undirected graph before calculating the clustering coefficient. 
The followlng code fragment shows how to call the funetion: 

cc = nx.clustering(nx.Graph(G), ego) 
print(cc) 

< 0.36251920122887865 

If the clustering coefficient of a node is 1, the node partlclpates in eveiy 
possihle triangle involving any pair of its neighhors; the egocentric network 
of such a node is a complete graph. If the clustering coefficient of a node is 
0, no two nodes in the neighhorhood are connected; the egocentric network 
of such node is a star. Think of the clustering coefficient as a measure of 
“stardom.” 
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Cluster Clusteri Lupus Est 

Just like some other terms, the term “clustering” refers to at least 
three different concepts: the separation of a network into compact, 
tightly knit communities [Outline Modularity-Based Communities, 
on page i 36); the measure of the density of an egocentric network; 
and the task of grouping relational data objects into subsets with 
similar properties. 


Function nx.average clusteringO calculates the mean clustering coefficient for ali 
nodes of a simple network (no loops, no directed or parallel edges). 


acc = nx.average_clusterlng(nx.Graph(G)) 
print(acc) 


< 0.7266398872539529 


The average clustering coefficient is not to be confused with the clustering 
coefficient of the whole network—the fraction of all possible triangles that 
exist tn the network. The latter is known as transitivity, a measure of transitive 
closure (explained on page 5). NetworkX has a namesake function to calculate 
it, too: 

trans = nx.transitivity(G) 
print(trans) 

< 0.03412721874374035 


You can see the discrepancy between the two alternative measures of the 
“stardom.” The source of the discrepancy is a considerable proportion of 
nodes with few neighbors. For such nodes, the local clustering coefficient 
is traditionally high, as shown in the figure on page 89, and it inflates the 
mean value. 

By the way, the figure presents the results of a real, though so far concealed, 
exploratory complex network expeiiment. You will see the mechanics of sim¬ 
ilar experiments in Choose the Right Centralities, on page 92. 

Think in Terms of Paths 


Both dyadic and triadic relationships are local and never go farther than one 
edge (or one “hop,” as network researchers say) from any of the involved 
nodes. The purpose of the functions in this section is to take you far away— 
as far as your network can afford. 

For that, you need definitions of a walk, trall, and path. 
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Degree 


• A walk in a network is any sequence of edges such that the end of one 
edge is always the beginning of another edge, except possibly for the first 
and last edges that may be eonneeted only at one end. 

• A trail is a walk that never uses the same edge twice. A trail that does not 
interseet itself, but starts and ends at the same node, is called a cycle (a 
self-loop edge explained on page 18 is a cycle). 

• A path is a trail that never visits the same node twice (in other words, it 
does not interseet itself; NetworkX refers to paths as “simple paths”). 

Any of these walks is directed if any of its constituent edges is directed. For 
the rest of the book, we will use only paths. 

A path has the length. The length of a path in an unweighted network is the 
number of edges in the path. When it comes to weighted paths, it is up to 
you to decide how to calculate the length. Possible metrics include the number 
of edges, the sum of the weights, the harmonic average of the weights, and 
the largest or the smallest weight. 
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A path is a highway across the network that indirectly connects the nodes 
that are not adjacent to each other, and allows them to interact. The meaning 
of the remote interaction is specific to each class of complex networks. In a 
social network, a path of length 3 connects the ego to the friend-of-a-friend- 
of-a-friend. In the network of Wikipedia pages, a path of length 3 connects 
the page about neighborhoods to a page that is similar to a similar page. In 
a transportation network, a node three hops away is the destination you can 
get to by changing trains twice. In some networks (like the network of foods 
and nutrients), long paths make no sense at all. 

Two nodes in a network are often connected with more than one path. If paths 
matter at all, the shortest of them matters the most. (Again, there may be 
more than one shortest path between two nodes.) The shortest paths are 
called geodesics. 

Not only does NetworkX provide a set of tools for computing with paths, but it 
also uses them for component detection and centrality calculation, to name 
a few applications. Function nx.shortest_simple_paths(G,u,v) returns a generator of 
all shortest paths between the nodes u and v. You can expand the generator 
into a list, but beware: it may take the program hours and even days to elicit 
all shortest paths in a large graph. Use this function with care! For example, 
you can get one path at a time by calling next(). 

path_gen = nx.shortest_simple_paths(G, ego, "Agent Based Model") 
next(path_gen) 

< ['Neighbourhood (Graph Theory)', 'Clustering Coefficient', 

'Social NetWork', 'Agent Based Model'] 

next(path_gen) 

< ['Neighbourhood (Graph Theory)', 'Edge (Graph Theory)', 

'Small World Network', 'Agent Based Model'] 

next(path_gen) 

< ['Neighbourhood (Graph Theory)', 'Vertex (Graph Theory)', 

'Semantic Network', 'Agent Based Model'] 

Function nx.shortest_path(G,source=None,target=None) returns only one of the shortest 
paths between source and target, but ifyou omit either or both of the parameters, 
it returns either all shortest paths startlng at source, all shortest paths ending 
at target, or all shortest paths in the network. 

path = nx.shortest_path(G, ego, “Agent Based Model") 

< ['Neighbourhood (Graph Theory)', 'Clustering Coefficient', 

'Social NetWork', 'Agent Based Model'] 
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Networks as Circles 

Reportedly, people used to believe that the Earth was flat and round, but 
later changed their mind and settled on the geoid—an ellipsoid-like, but not 
exaetly ellipsoidal body. We are gotng the opposite way: our complex networks 
started as shapeless clouds, but at this point let’s tiy to treat them as flat 
and round. 

CNA offers a concept of node eccentricity—a measure of how far from (or close 
to) the center a node is, wherever the center is. The eccentricity is the maxi¬ 
mum distance from a node to all other nodes in the network. The distance 
between two nodes is naturally defined as the length of the geodesic between 
the two nodes. Function nx.eccentricity(G,v=Node) retums the eccentricity for one 
node V or the whole graph. Note that in a directed graph, there may be no 
dtrected geodesics for some pairs of nodes. You must decide if it is appropriate 
to coerce the digraph to an undirected graph. 

ecc = nx.eccentricity(nx.Graph(G)) 
print(ecc[ego] ) 

< 3 

The remaining “circular” network propertles are defined through the eccen¬ 
tricity. If you already calculated it, do not throw it away, but pass to the fol- 
lowing functions for the sake of performance. 

• The diameter of a network is the maximum eccentricity. If two nodes are 
as far apart as possible, they must be at the diametrically opposite ends 
of the network, right? 

• The radius of a network is the minimum eccentricity. This definition is 
not intuitive, but it is what it is. Whafs more counterintuitive, in general, 
is that the radius is not a half of the diameter. 

• The center of a network is a set of all nodes whose eccentricity equals the 
radius. Another not veiy intuitive definition—but it yields a surprisingly 
accurate resuit (see the following example). 

• The peripheiy of a network is a set of all nodes whose eccentricity equals 
the diameter. The set of peiipheral nodes in a complex network is usually 
large. 

In the following examples, all circular measures are calculated based on the 
precomputed eccentricity. There is no need to transform the digraph into a 
directed graph anymore. 
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print(nx.dianneter(G, ecc)) 

< 4 

print(nx. radius(G, ecc)) 

< 2 

nx.center(G, e=ecc) 

# Bingo! 

< ['Complex NetWork'] 
nx.periphery(G, e-ecc) 

['Glossary Of Areas Of Nathematics' , 'Nutrition', 'Domestic Technology' , 
«...», # 2,869 nodes! 

'Pierre Bourdieu' , ' Sociology Of Law' , 'NetWork Scheduler'] 

The eccentrlcity is a special case of path-based centralities: measures that dis¬ 
criminate nodes by their position in the network. Centralities are quintessential 
for social network analysis and most types of CNA in general. 

Choose the Right Centralities 

One of the goals of soeial network analysis is to identify 
actors with outstanding properties: the most influential, the 
most efficient, the most irreplaceable—in other words, the 
most important. CNA, in general, is also looking at the most 
important nodes: key products in product networks; key 
words in semantic networks; key events m the networks of events, and the 
like. One of the Central premises of CNA is that the importance of a node 
depends on the structural position of the node in the network and can be 
calculated from neighborhoods, geodesics, or some other structural elements. 
Let’s go over some of the most common centrality measures, without going 
deep into the theory. (If you’re a curious reader, treat yourself with Social and 
Economic Networks [JacOS], which covers the theoiy of centralities and many 
other CNA topics!) 

Degree Centrality 

The simplest centrality measure is a node degree (also indegree and outdegree, 
whenever necessary). Intuitively, a node with more edges, representing, say, 
an actor with more ties, is more important than a node with only one edge. 
Degree centrality is local and depends only on the node neighborhood. You 
saw this centrality in disguise in Truncate the Network, on page 46. 

You may argue that the node with the largest degree in a small network may 
have fewer edges than the node with the smallest degree in a huge network. To 


This section uses 
Matplotiib, Pandas, 
NumPy. 
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level the playing field, divide the number of edges by the maximal possible 
number of ineident edges. Remember that in a network with a simple graph (no 
loops, no parallel edges), a node can have at most len(G)-l neighbors. The redefined 
degree eentrality is always in the range from 0 (the node has no neighbors) to 
1 (the node is the hub of the global star). The normalization makes it possible 
to compare nodes from different networks. The subject with the highest degree 
eentrality in our Wikipedia network is Computer Network (0.227988). 

A node with a high degree eentrality may be capable of affecting a lot of 
neighbors in its neighborhood at once, but we cannot say an 5 d;hing about the 
opportunities for global outreach. 

Closeness and Harmonic Closeness Centrality 

The closeness centrality is defined as the reciprocal mean distance (length of 
the geodesies) from a node to all other reachable nodes in the network. It 
shows how close the node is to the rest of the graph. This centrality is also 
in the range from 0 (the node has no neighbors; it is severed from the rest of 
the network) to 1 (the node is the hub of the global star and is one hop away 
from any other node). 

Another way to quantify the sense of closeness is to look at the mean reciprocal 
distance (as opposed to the reciprocal mean distance; the order of the sum 
and reciprocal operations reverses). Such measure is called harmonic central¬ 
ity. Regrettably, the NetworkX function for calculating harmonic centrality does 
not normalize the resuit. Make sure you divide it by len(G)-l to obtain compa- 
rable measures. 

When the closeness of a node is equal to 0 or 1, the harmonic closeness of 
the same node is 0 or 1, too. However, the two centralities in general differ 
and in the case of our Wikipedia network are not even strongly correlated. 
The subject with the highest closeness centrality is Computer Network 
(0.517678); Graph (Discrete Mathematies) performs best in terms of harmonic 
closeness (0.027257). 

A node with a high degree centrality may be capable of affecting the entire 
network, but how about controlling it? 

Betweenness Centrality 

The betweermess centrality is for control freaks. It measures the fraction of 
all possible geodesies that pass through a node. If the betweenness is high, 
the node is potentially a crucial go-between (thus the name) and has a bro- 
kerage capabillty. The removal of such a node would disrupt Communications 
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in communicatlon networks, lengthen geodesics, lower closeness centralities, 
and possibly split the network into disconnected components. 

The subject wlth the highest betweenness centrality in our Wikipedia network 
is Computer Science (0.005515). 

You can often find high-betweenness nodes in the vicinity of bridges. A 
bridge is an edge whose removal would disconnect the network or signifi- 
cantly increase the length of the geodesics. The latter kind of bridge is called 
a local bridge. Pure bridges are rare in complex networks, but local bridges 
are not. 

Eigenvector Centrality 

Unlike the previously introduced centrality measures that rely on the neigh- 
borhoods and geodesics to calculate the importance, the eigenvector central¬ 
ity uses a recursive definition of it: ‘Teli me who your friends are, and 1 will 
teli you who you are.” (Incidentally, the sa 3 dng can be traced back to Proverbs 
13:20: “He that walketh with wise men shall be wise: but a companion of 
fools shall be destroyed.”) Mathematically, the eigenvector centrality of a node 
is the sum of the neighbors’ eigenvector centralities divided by X—the largest 
eigenvalue of the adjacency matrix of the network. 

High eigenvector centrality identifies nodes that are surrounded by other 
nodes with high eigenvector centrality. You can use this measure to locate 
groups of interconnected nodes with high prestige. 

The subject with the highest eigenvector centrality in our Wikipedia network 
is Graph (Discrete Mathematics) (0.183307). 

PageRank 

At least two more types of centralities are based on recursive principies similar 
to the eigenvector centrality: PageRank and HITS [PBMW99]. 

PageRank was developed by Google (and named after Google’s Larry Page) to 
rank web pages. The web pages are represented by nodes in a directed graph. 
The graph edges correspond to hjqrerlinks. The rank of a node (and the corre- 
sponding page) in the network is calculated as the probability that a person 
randomly traversing the edges (clicking on links) will arrive at the node (page). 
The algorithm is parametrized by the damping factor alpha=0.85, which is the 
probability that the user will continue clicking. The page with the highest 
PageRank is the most attractive: no matter where the person starts, this page 
is the most likely ftnal destination. 
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Markov Page 

If you’re familiar with Markov chains, you may have noticed that 
PageRank wlth alpha=l treats the network as a Markov chain and 
calculates its stationary distribution. The damping factor intro¬ 
duces an element of realism in the computation: the Web is 
dynamic and does not have a stationaiy state. 


PageRank thrives on the concept of link traversing and makes sense only in 
directed networks. Ifyou pass an undirected graph to nx.pagerankO, the function 
will first convert it into a directed graph by replacing each undirected edge 
with a pair of directed edges. The subject with the highest PageRank in our 
Wikipedia network is Graph (Discrete Mathematics) (0.000836). 


HITS Hubs and Authorities 

The HITS (Hyperlink-Induced Topic Search) algorithm is an extended version 
of PageRank. PageRank considers all graph nodes as potential terminals, or 
“sinks.” Once you get into a sink, you likely get sunk. “Sink-style” networks 
include the Web, trust networks (social networks built on the “A-trusts-B” 
relationship), and organizational networks (“A-is-a-subordinate-of-B”). 

You want to study a network from the opposite perspective: what is the 
probability that a person randomly traversing the edges has started at the 
node? You can either reverse the graph by calllng G.reverseO and then calculate 
the PageRanks—or execute the HITS algorithm and get both hubs and 
authorities values. Authorities are a loose counterpart of the PageRank. Hubs 
considers outgoing links instead of incoming links. They serve as entiy points 
into your network so that you (or the fictitious randomly traversing person) 
could get to the authorities most efficiently. 

The subjects with the highest hubs and authorities in our Wikipedia network 
are Social Network (0.037699) and Graph (Discrete Mathematics) (0.005213), 
respectively. 


Comparing the Centralities 

As Ulrlk Brandes from University of Konstanz mentioned in his keynote address 
at the International Conference on Computational Social Science in July 2017, 
“There are several hundred centrality indexes.” You just learned about seven 
or eight centrality measures (depending on whether to count the HITS as one 
or two), which may be 1 percent of all possible ways to establish a numeric 
order in a network. Do you have to learn about the remaining 99 percent? 
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No, you don’t. First, those eight centralitles adequately rank the nodes in any 
complex network. Second, even some of those eight centralities are strongly 
correlated. Let’s find out which. 

This next code calculates eight types of centralities for each node in the Wfkipedia 
page network. Each function (except for nx.hitsO) retums a dictionary with nodes 
as keys and centralities as values. nx.hitsO retums a list of two dictionaries. 

measuring.py 

dgr = nx.degree_centrality(G) 
clo = nx.closeness_centrality{G) 
har = nx. harrionic_centrality (G) 
eig = nx.eigenvector_centrality(G) 
bet = nx.betweenness_centrality(G) 
pgr = nx.pagerank(G) 
hits = nx.hits{G) 

centralities = pd.concat( 

[pd.Series(c) for c in (hits[l], eig, pgr, har, clo, hits[0], dgr, bet)], 
axis=l) 

centralities.columns = {“Authorities" , “Elgenvector" , “PageRank" , 

"Harmonic Closeness" , "Closeness" , "Hubs" , 

"Degree", “Betweenness" ) 

centralities[ "Harmonic Closeness"] /= centralities.shape[0] 

Then comes Pandas [chec\^AnotherKindofPandas, on page 73). Let’s convert 
each dictionaiy into a pd.Series—a labeled vector; then concatenate all vectors 
into a pd.DataFrame—a labeled matrix. Finally, relabel the columns to reflect 
their true nature and normalize the harmonic closeness centrallty, because 
it is the only centrality on the list that is not automatically normalized. The 
DataFrame centraiities is ready for analysis. 

centraiities.corrO calculates all pairwise correlations between the centralities and 
retums an 8x8 symmetric DataFrame. More than half of the values in the 
DataFrame are redundant. Use np.triO from NumPy to generate a lower-left trian- 
gular unit matrix of the proper slze and mask the dupllcates by multlpl 3 dng 
the DataFrame and the mask matrix element-wlse. 

Locatlng the strongest correlations in the table may be hard. Let’s reorganize it 
Into a tali pd.Series, sort by the values, and dlsplay the “tali”—the last five rows. 

measuring.py 

# Calculate the correlations for each pair of centralities 

c_df = centraiities.corrO 

ll_triangle = np.tri(c_df.shape[0], k=-l) 

c_df *= ll_triangle 

cseries = c_df.stack().sort_values() 

cseries.tail() 
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< Harmonic Closeness 
Closeness 
Betweenness 
PageRank 
Eigenvector 
dtype: float64 


Eigenvector 

Authorities 

Degree 

Eigenvector 

Authorities 


0.826834 

0.828464 

0.849580 

0.882332 

0.939547 


The complete analysis of all correlations (which you can do on your own, 
especially after reading Outline Modularity-Based Communities, on page 136) 
reveals that the centrality measures form two groups. The first group consists 
of eigenvector and harmonic closeness centralitles, PageRank, and authorities. 
The second group has two subgroups: degree and betweenness centralitles 
in one, and closeness and hubs in the other. 1 am almost saylng that knowing 
one representative measure from each group—say, closeness, betweenness, 
and eigenvector centralitles—probably will suffice for all practical purposes. 
But the final choice is yours. 

To add another dimension to our story of centralitles, let’s plot one of them 
agalnst another. Pandas DataFrames are elegantly integrated with Matplotiib. It 
takes just a couple of function calls to plot two columns. 

measuring.py 

X = "Harmonic Closeness" 

Y = "Eigenvector" 

limits = pd.concat([centralitiesl[X, Y]].min(), 

centralitiesl[X, Y]].max()], axis=l).values 
centralitles.plot(kind="scatter", x=X, y=Y, xlim=limits[0], ylim=limits[l], 
s=75, logy=True, alpha=0.6) 


The figure on page 98 shows the scatter plot of harmonic closeness and 
eigenvector centrality for our network of Wikipedia pages. Note that the vertical 
axis has a logaiithmic scale to accommodate small eigenvector centralitles. 
Judging by the plot, the correlation of 0.826834 is entirely justifiable. 


Estimate Network Uniformity Through Assortativity 


This section uses NumPy. 


In the last section of the chapter, let's look at node attributes 
we have completely ignored so far. As an example, Pll use a 
snapshot of Odnoklassniki,^ the second-largest Russian language social net- 
working site, harvested in Februaiy 2009. The snapshot has 408,715 nodes 
and 4,482,086 edges. Each node has attributes age and gender (self-reported). 


Attribute analysis looks into assortativity: correlation between the values of 
a node attribute across edges. A network with positively correlated attributes 
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is called assortatlve; in an assortative network, nodes tend to connect to nodes 
with similar attribute values. This tendency is called assortative mixing. A dis- 
sortative (negatively correlated) network is the opposite of an assortative one. 

The simplest form of assortativity is degree (indegree, outdegree) assortativity: 
the correlation between the degree of a node and the average degree of its 
neighbors. Function nx.average_degree_connectivity(G) retums a dictionary with 
unique node degrees as keys and matching average neighbors’ degrees as 
values. The following code fragment calculates the dictionary and separates 
the keys and values into two lists (my degree and their degree): 

my_degree, their_degree = zip(*nx.average_degree_connectivity(G) .itenns{)) 

The figure on page 99 shows the scatter plot of the two lists for the Odnoklass- 
niki network. If the network were uniform, all points would align along the 
dashed line. In reality, we observe the uniformity only around the nodes with 
about seventy neighbors. Nonetheless, the network is in general assortative 
because the slope of the curve is positive. The only dissortative part of the 
network is the one that contains the nodes with fewer than ten edges. 

Degree assortativity may be somewhat hard to interpret, but when it comes 
to other attributes, especially related to human demographics, there are certain 
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expectations—and a social theory that explains them: homophily. Homophily 
is the propensity of actors to associate with somewhat similar actors. Let’s 
check if our fragment of the social network is assortative with respect to age 
and gender. 

NetworkX provides two functions for assessing attribute assortativity. The first 
function nx.attribute_mixing_matrix() takes a graph, an attribute name, and an 
optional mapping dictionary, and retums a two-dimensional NumPy array. 

nx.attribute_mixing_matrix(G, "gender", mapping={"M": 0, "F": 1}) 

< array([[ 0.22771058, 0.24064205], 

[ 0.24064205, 0.29100532] ] ) 

The ith row and jth column of the array contain the fraction of adjacent nodes 
that have the ith and jth values of the attribute, respectively. The mapping 
links non-numeric attribute values with row and column indexes. In the 
previous matrix, 0.22771058 (-23%) of edges connect male actors, and 
0.29100532 (=29%) of edges connect female actors. The fraction of same- 
gender edges is Just above 50 percent, which suggests that the members of 
the network do not prefer same-gender connections. The network is not 
homophillc from the gender point of view. 
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The second function, nx.attribute_assortativity_coefficient(), confirms the previous 
resuit. The function returns the assortativlty coefficient—the correlation 
hetween the values of an attribute across edges. 

nx.attribute_assortativity_coefficient(G, “gender" ) 

< 0.03356000539110733 

The self-reported genders of the Odnoklassniki memhers are not correlated. 
However, their ages are in a hetter agreement: 

nx.attribute_assortativity_coefficient(G, "age" ) 

< 0.14409535867553133 

The last resuit is not surprismg at all tf we recall that odnoklassniki in Russian 
means classmates and It is natural for classmates to he of the same age. 

You leamed how to measure a complex network hy calculattng its microscopic, 
mesoscopic, and macroscoplc properties, such as size, density, clusteiing 
coefficient, centralities, and assortativities. The measured properties identify 
the most important or unusual nodes and network neighhorhoods. With a 
couple dozen network measuring algorithms in your toolhox, you are ready 
for a complete network analysis experiment. In the next chapter, you will go 
through the first complete CNA case study. 
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This good luck ofthose ofChagre caused Captain Morgan to stay longer 
at Panama, ordering severat new excursions into the country round 
about; and while the pirates at Panama were upon these expeditions, 
those at Chagre were busy in piracies on the North Sea. 

Alexandre Otivier Exquemetin, French, Dutch orFlemish writer 

chapter9 

Case Study: Panama Papers 

The Panama Papers^ represent a massive leak of offshore 
corporate entity Information (several hundred thousand 
entities) from the Panamanian law firm Mossack Fonseca. 
The papers unveil a never-before-seen network of money- 
laundeiing connectlons. 

In this chapter, you will leam how to convert a huge CSV file describing 
connectlons between entities and officials into a soclal network. You will do 
it two ways: wlth and without Pandas. You will also leam how to make simple 
conclusions about the resulting network. 

Create a Network of Entities and Officers 

The “Panama” network is a social network that describes relationships 
between organizations and individuals traced through electronic documenta- 
tion. The network is available in live CSV files that is summarized in the table 
on page 102. 

Let’s first partially build this vast network and analyze some of Its aspects 
without using Python’s “heavy artilleiy,” Pandas, and later attempt a similar 
analysis with Pandas. For the construction, let’s select only the edges that refer 
to the “beneficiary-of’ relationship (there are 19,194 edges labeled Benejiciary 
of and benejiciary of. We will go through the files Entities.csv, Officers.csv, and 
intermediaries.csv in search of incident nodes. For each node, we will store its 
name, t 3 rpe, and a three-letter countiy code. The first code block imports all 
the necessaiy modules and defines the constants. 


This chapter uses 
Matplotiib, NumPy, 
Pandas. 


1. www.occrp.org/en/panamapapers/database 
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Name 

Type 

Purpose 

# of rows 

Columns of 

interest 

all_edges.csv 

Edges 

Each edge has a type 
of the represented 
relationship. 

1,269,796 

node_l, 

rel_type, 

node_2 

Addresses.csv 

Nodes 

Legal addresses of 
offieers and entities 

151,127 

n/a 

Entities.csv 

Nodes 

Legal entities (eorpo- 
rations, firms, and 
so on) 

319,421 

name, 

jurisdiction 

Intermediaries.csv 

Nodes 

Persons and organiza- 
tions that act as 

links between other 
organizations 

23,642 

name, 

country_code 

Officers.csv 

Nodes 

Persons (direetors, 
shareholders, and 
so on) 

345,645 

name, 

country_code 


panama.py 

import csv 

import pickle 

import itertools 

from collections import Counter 

import networkx as nx 

from networkx.drawing.nx_agraph import graphviz_layout 
import matplotlib.pyplot as plt 
import dzcnapy_plotlib as dzcnapy 

EDGES = "beneficiary" 

NODES = {(“Entities. csv" , “jurisdiction" , “name"), 
("Officers.csv", ''country_codes'' , “name''), 
(“Intermediaries. csv" , "country_codes" , "name")) 



Where to Import? 

Accordlng to PEP 8, Style Gulde for Python Code, “Imports are always put at the 
top of the file, Just after any module comments and docstrings, and before module 
globals and constants.”^ Some developers argue that Importing a module immedl- 
ately before Its first use saves a nanosecond or so. It mlght, but it sure makes 
tracklng code dependencles on other llbraries a nlghtmare. 


2. WWW. python.org/dev/peps/pep-0008/#imports 
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Now, it’s time to bulld a network graph. Start with an empty nx.Graph objeet and 
read all rows from all edges.csv with a CSV dictionary reader. (But keep only 
those future edges that are marked as benejiciary of in any charaeter ease.) 

panama.py 

panama = nx.GraphO 

with open("aiI_edg'es.csi/") as infile: 
data = csv.DictReader(infile) 

panama.add_edges_from{(link[ "node_t" ], link[ "node_2" ]) 
for link in data 

if link[ "rel_type" ].lower().startswith(EDGES)) 

Remember that when you add an edge to a network, NetworkX also adds both 
ineident nodes. However, at the moment, the nodes have only more or less 
randomly chosen labeis—and no attributos. Let’s import the attributos and 
true names from the other three files. 

The purpose of the dictionary nodes is to facilitate future lookup. (P 3 hhon lists 
have linear lookup time.) Read eaeh of the files with a CSV dictionary reader 
and extraet and collect the desired attributos. Note that there is no need to 
process rows that do not match any existing node (because your network 
does not include aU nodes and edges) and add any nodes to the graph (beeause 
they have been already added by way of the incident edges). When done, 
update the node attributos country and kind, and relabel the nodes to match 
persons and organizations names. 

panama.py 

nodes = set(panama.nodes()) 
relabel = {} 

for f, cc, name in NODES: 
with open(f) as infile: 
kind = f.split( " ." )[0] 
data = CSV .DictReader(infile) 
names_countries = {node ["node_id"] : 

(node[name].strip().upper(), node[cc]) 

for node in data 

if node ["noc/e_id"] in nodes} 

names = {nid: values[0] for nid, values in names_countries.items()} 

countries = {nid: values]!] for nid, values in names_countries.items()} 

kinds = {nid: kind for nid, _ in names_countries.items()} 

nx.set_node_attributes(panama, "country" , countries) 
nx.set_node_attributes(panama, "kind", kinds) 
relabel.update(names) 

nx.relabel_nodes(panama, relabel, copy=False) 

if "I5SUES OF:" in panama: 

panama.remove_node ( "ISSUES OF :" ) 
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if in panama: 

panama.remove_node ( "" ) 

print(nx.number_of_nodes(panama) , nx.number_of_edges(panama)) 

Finally, remove the phony node ISSUES OF: (there is no explanation of its 
purpose in the dataset doeumentation) and a mysterious node wlth no name 
at all. The complete network panama has 27,930 (3.3%) nodes and 19,137 
(1.5%) edges. The density of the network is 0.000049. 

As it tums out, the newly mlnted network consists of several thousand dis- 
connected fragments called components, most ofwhich have only two to three 
nodes. You will leam how to work with components in Split Networks into 
Connected Components, on page 126. However, looking ahead, you can use 
function nx.connected component subgraphsO to elicit each componenfs graph and 
keep it only if it has at least twenty edges or twenty nodes. The choice of 
numbers is somewhat arbitraiy; you may want to play with them to wipe off 
as much of the “network dust” as you wish. In fact, you are under no obligation 
to do any filtering at all—or you can select only the biggest component. 

panama.py 

components = [p.nodesO for p in nx.connected_component_subgraphs(panama) 
if nx.number_of_nodes(p) >= 20 
or nx.number_of_edges(p) >= 20] 

panama0 = panama.subgraph{itertools.chain.from_iterabie(components)) 

print(nx.number_of_nodes(panama0) , nx.number_of_edges(panama0)) 

with openCpanama-beneficiary.pickle'', "wb") as outfile: 
pickle.dump(panama0, outfile) 

The refined network panamaO has 1,393 nodes and 1,926 edges. Pickle it for 
future use! 

Draw the Network 

The size of the network generated in the previous section makes its visualization 
almost useless. However, the plotting fragment is stili included in the case study. 
Nodes are painted by thelr kind: Entities are lightly colored, and Officers are dark. 

panama.py 

cdict = {"Entities": "pink" , "Officers": "blue", 

"Intermediaries" : "green"} 
c = [cdict[panama0.node[n] ["/clnd"] ] for n in panama0] 
dzcnapy.small_attrs["noc/e_coIor'‘] = c 
pos = graphviz_layout(panama0) 

nx.draw_networkx(panama0, pos=pos, with_labels=False, **dzcnapy.small_attrs) 
dzcnapy.set_extent(pos, plt) 
dzcnapy.plot( "panamaO" ) 
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The followlng figure shows the sketch of the network (naturally, wlthout the 
lahels). You can see with a naked eye that some components have an Officer 
m the center, surrounded hy Entities; in some other components, conversely, 
Officers surround an Entity in the center. 



We can leam more ahout the network hy applying numerical analytical 
methods to it. 

Analyze the Network 

Out of so many ways to analyze a network, let’s focus on degree and assorta- 
tivity analysls. All network nodes have two attrihutes: kind (with three possihle 
values Entities, Officers, or Intermediaries) and country. Let’s have a look at 
each of the assortativities, hoth drrectly and through the attribute mixing 
matrix (only for the kind). 

panama.py 

nx.attribute_assortativity_coefficient(panama0, “kind" ) 
nx.attribute_mixing_matrix(panama0, “kind" , 

nnapping={"£ntrties" : 0, “Officers"-. 1, 
“Intermediaries" : 2}) 
nx.attribute_assortativity_coefficient(panama0, “country" ) 
nx.degree_assortativity_coefficient(panama0) 
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< -0.9896603076687625 

array([[ 0.00000000e+00, 4.97403946e-01, 2.596054006-04], 

[ 4.974039466-01, 4.672897206-03, 0.000000006-H00] , 

[ 2.596054006-04, 0.000000006-H00, 0.000000006-H00] ] ) 

0.07539400284377736 
-0.39717073403670283 

The network is almost perfectly dissortative with respect to the node kinds. 
The matrix explains the details: an Entity node (column 0) is almost always 
connected to an Officer node (column 1) hecause Officers are “beneficiaiies- 
of’ Entities. Very few Entities are Irnked to each other through Intermediaiies. 
The most obscure edges cormect Officers to Officers. Without fuUy understand- 
ing the semantics of the relationship “beneficiary-of,” you cannot judge whether 
these edges are legitimate or not. Nonetheless, the matrrx gives you a gener- 
alized snapshot of how a t 5 rpical network fragment would look. 

The “Panama” network is rooted in corporate offshoring. You would expect 
that the countiy codes associated with network nodes are quite diverse 
hecause that is what offshoring is all about. And indeed they are: a correlation 
close to 0 (0.075) is no correlation. The network of countiy codes is neither 
assortative nor dissortative. It has random connectivity, totally appropriate 
for offshoring-based money laundering. 

The last number in the output is the degree assortativity coefficient. It is 
negative, suggesting that the nodes with a higher degree are surrounded, on 
average, by nodes with a smaller degree, and the other way around. 

An essential output of complex network analysis is a node degree distribution. 
According to Barabasi andAlbert [BA99], if a network is a resuit of preferential 
attachment, then the degrees d in the network are distributed by the power 
law: p(d)=d"^. The converse, in general, is not true. (See a brief overview on 
page 5 and more in What Makes Components Giant?, on page 129.) If you 
apply logO to the equation, you will get a linear dependency: log(p(d))=-a log(d). 

Let’s check if the network has at least a chance of being a Barabasi-Albert 
network. The next code fragment calculates the degrees of all nodes, and 
counts and plots the frequency of each degree. 

panama.py 

d6g = nx.d6gr66(panama0) 

X, y = zip(*Count6r(d6g . valu6s()) . it6ms()) 

The degree distribution in the log-log scale is in the figure on page 107. The 
dots align along a noisy but clearly visible straight line, signaling you that 
the “Panama” network, like many other social networks, may have been put 
in order by the forces of preferential attachment. 
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One peculiarity of a power law distribution is its “long tail” that, at least the- 
oretically, tolerates nodes wlth arbitrary high degrees, limited only by the 
total graph size. Your degree distribution has a long tail, too. So, who are 
they, the nodes at the tail? Let’s run the final lines of the analysis script that 
reports the top ten nodes with the highest degree, nicely formatted. 

panama.py 

toplO = sorted([(n, panama0.node[n] ["/cind"] , v) for n, v in deg . items () ], 
key=lambda x: x[2], reverse=True)[:10] 
printCln" .joinl ("O ({}): .format (*t) for t in topl0])) 

< HELITING S.A. {0fficers): 80 

T.K.B.K. INTERNATIONAL TRUST (Entities): 39 
WORLDWIDE COM-NET INTERNATIONAL TRUST (Entities): 37 
THE CLAUDIUS TRUST (Entities): 36 

GUANGZHOU CONSTRUCTION & DEVELOPMENT HOLDINGS (CHINA)LIMITED (Officers): 29 

RICARDO CAMPOLLO CODINA (Officers): 27 

ISLANDS INTERNATIONAL TRUST (Entities): 27 

ZEN TRUST (Entities): 26 

FRIENDS OF ASSISI TRUST (Entities): 26 

MR. OLEKSII MYKOLAYOVYCH AZAROV (Officers): 26 

Four of the tail nodes are Offieers; the other five are Entities. The Intermedi- 
aries node did not make it to the top list. 1 don’t know who these individuals 
and organizations are. Chanees are, you don’t know, either. In that case, just 
include the results Into your final report and deliver to your data sponsor— 
someone who ordered the analysis of this complex network. 
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Build a "Panama" NetWork with Pandas 

Pandas is a libraiy that is known to make hard tasks of reading tabular files 
and manipulating tabular data easy. Pandas takes care of parsing an edge list 
from a CSV file into a DataFrame; nx.from_pandas_dataframe() converts a DataFrame 
into a graph, making network construction all but trivial. 

Now let’s take care of the nodes. Read the three node attribute files into therr 
DataFrames, mark properly (so that you would know later which nodes come 
from each file), and merge the parts Into one DataFrame named all nodes. 

panama-ca.py 

import networkx as nx 
import pandas as pd 
import numpy as np 

# Read the edge list and convert it to a network 
edges = pd . read_csv( "an_edges. csi/" ) 

edges = edges [edges [ "rei_type" ] != "registered address''] 

F = nx.f rom_pandas_dataf rame(edges, ''node_l'' , "node_2") 

# Read node lists 

officers = pd . read_csv( "Officers. csi/" , lndex_col="node_id" ) 
intermediaries = pd. rea(i_cs\/ {" Intermediaries. csv" , index_col="nocfe_ic/" ) 
entities = pd . read_csv( "Entities. csi/" , lndex_col="node_id" ) 

# Combine the node lists into one dataframe 
officers ["type"] = "officer" 
intermediaries[ "type" ] = "intermediary" 
entities ["type"] = "entity" 

allnodes = pd.concat([officers, intermediaries, entities]) 

Just like any other real-life data, the “Panama papers” dataset is “dirty”: it 
has duplicates, omissions, typos, and so on. With the following code fragment, 
you can partially unify the duplicated names by removmg all leadtng and 
trailing whitespaces, merging all inner whitespaces, converting all characters 
to the uppercase, converting LIMITED to LTD, and removing some honorifics. 

panama-ca.py 

# Do some cleanup of names 

all_nodes[ “name" ] = all_nodes[ "name" ] .str.upper().str.strip() 

# Ensure that all "Bearers" do not become a single node 
all_nodes[ "name" ]. replace( 

to_replace=[r"MRS?l .ls+", r"].", r"ls+", "LIMITED", "THE BEARER" , 

"BEARER", "BEARER I", "EL PORTABOR", "AL PORTADOR"], 
value=["", " ", "LTD", np.nan, np.nan, np.nan, np.nan, np.nan], 
inplace=True, regex=True) 
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A lot of “Panama” officials go under the nicknames “THE BEARER” and “EL 
PORTADOR.” If left unchanged, they may be later lumped into a slngle network 
node. Rename them to a NumPy value np.nan to keep them anon 3 Tnous but 
distinet. 

The network is structurally ready, but the nodes do not have attributos—the 
attributos are stored in a separate DataFrame. Attaehing them to the nodes now 
is impractieal. It is highly unlikely that you will analyze the whole network 
at once, so it makes no sense to invest into deeorating all the nodes. Let’s 
first identify the area of interest, extract it from the network, and then assign 
the attributos and new labeis to the survlving nodes. This order of network 
construction is contraiy to anything youVe seen so far, but for a big network, 
it saves you a lot of CPU time and eomputer memory. 

As an exercise, extract a network of officers, entities, and intermediaries 
related to the economically, politically, and geographically compact region— 
Central Asia (Kazakhstan, Kyrgyzstan, Uzbekistan, Turkmenistan, and 
Tajikistan). Offshoring, especially fraudulent offshoring, is not veiy widespread 
in this area, and we can hope to obtain a reasonably small subnetwork. 

Start with a list of seed nodes seeds —all nodes that are known to belong to 
the Central Asian region because their country codes are in the set of wanted 
countiy codes CCODES. 

In reality, some nodes have more than one countiy code, in which case the 
codes are separated by a semicolon. Can you modify the code to select the 
nodes that are associated with Central Asia through at least one of their 
countiy codes? 

What you will do next is essentially construet a joint ego network of the seed 
nodes expanding two hops away from the seeds. The function nx.single source short- 
est_pathJength(F,seed,cutoff=None) computes the shortest paths from the node seed 
to all reachable nodes that are cutoff hops away and closer. The function retums 
a dictionaiy with the target nodes as keys, so the keys are the cutoff-neighborhood 
of the seed. Extract a subgraph that contains all the keys for all the dictionaries 
with all the coimecting edges. It is the network that you want. 

panama-ca.py 

CCODES = "UZB", "TKM" , "KAZ" , "KGZ" , "TJK" 

seeds = all_nodes [all_nodes [ “country_codes'' ] . isin(CCODES) ]. index 
nodesofinterest = set.union(*[\ 

set(nx.single_source_shortest_path_length(F, seed, cutoff=2).keys()) 
for seed in seeds]) 

# Extract the subgraph and relabel it 
ego = nx.subgraph(F, nodes_of_interest) 
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nodes = all_nodes.ix[ego] 

>■ nodes = nodes[-nodes.index.duplicated()] 

nx.set_node_attributes(ego, "cc", nodes [ "coL/ntry_codes" ]) 
valid_nannes = nodes [nodes [ "name" ] . notnull () ] [ "name" ] . to_dict () 
nx.relabel_nodes(ego, valid_names, copy=False) 

# Sa\/e and proceed to Gephi 

with openCpanama-ca.graphml" , "wb") as ofile: 
nx.write_graphml(ego, ofile) 

The subnetwork has 3,848 nodes and 8,643 edges—quite modest, by the CNA 
standards. Because of the imperfection of the dataset, some nodes may have 
identical identifiers—remove them (see the highlighted line). Finally, set the 
cc (countiy code) attribute and relabel the nodes that can be relabeled. 

You can save the resulting network into a .graphmi file for future use. For 
mstance, you can sketch the network in Gephi, as shown in the foliowing figure. 
(NetworkX itself is not powerful enough to produce images of big networks.) 
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The dark (blue) nodes represent the offlcers, entities, and intermediaries 
related to the Central Asian countries. The gray nodes are in thelr 2-neighbor- 
hood. The node size represents degree. You can see that the dark nodes 
congregate on the peripheiy of the network, despite formally being its seeds! 
The major offshoring companies—ShareCorp Ltd., Portcullis Trustnet (BVI) 
Ltd., and Execorp Ltd. (the three largest circles in the chart)—do not seem to 
be veiy interested in Central Asia. 

You can apply the method presented in the case study to any network of 
organizations (entities) and members (officials). Beware that if the network is 
large, it is better to avoid NetworkX for its visualization and use Gephi. 

In the Next Part 

It is exciting to analyze networks with explicitly connected nodes. It is even 
more exciting to analyze networks with no immediate connections. The next 
part of the book looks Into co-occurrence networks. 
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Part III 

Networks Based on Co-Occurrences 


An interesting, relatively understudied (compared 
to social networks), and crucial class of complex 
networks is networks based on item co-occurrence 
—items being in the same place (or close enough) 
at the same time. In this part, gou will leam how 
to construet “unorthodox” networks and analyze 
their structure. 




Life is, ofcourse, a series of coincidences, but we never cease to 
besurprised as each newone happens, andnothing can destroy 
their recurring freshness. 

Robert Lynd, Irish writer 


CHAPTER 1 0 

Constructing Semantic and 
Product Networks 


An interesting, novel, and relatively understudied class of complex networks 
is networks based on co-occurrence, or coincidence—the property of items 
being in the same place (or close enough) at the same time. The edges in eo- 
occurrence networks are implieit: they are not given (and often not even 
obvious); you have to deduce, extraet, and caleulate them from other data, 
and this is a signifieant departure from the relatively intuitive way you build 
soeial networks. Co-occurrenee networks are living proof that you can connect 
anything to an3d;hing and make sense of the connections. 

In this chapter, you will leam how to start with a seemingly odd colleetion of 
material or immaterial items, examine temporal and spatial eonneetions 
between them, identify signifieant relationships, and convertyour observations 
into a network graph. Just like soeial network graphs, these graphs have 
nodes and edges with respeetive attributes, but this time you will go one step 
further and explore their complex intemal structure. You wUl be able to divide 
and conquer a eomplex network: decompose it into components, eores, eoro- 
nas, eommunities, and similar stmetural elements; assign proper names to 
the extracted parts; understand their purpose and importance; and put them 
together again. 

We will start by looking at two examples of eo-occurrenee networks: semantie 
networks and produet networks. In the next chapters, we’ll go over definition, 
extraetion, naming, and use of complex network eonstituents. 
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Semantic Networks 

A semantic network is a network of nodes that represent terms—^words, word 
stems, word groups, or concepts—connected based on the similarity or dis- 
similarity of their usage or meanings. Link terms that: 

• Are commonly used together in the same place in text: same sentence, 
paragraph, chapter, scene, act, list of keywords, list of interests in a social 
network, and so on (“semantic” “network”) 

• Describe the same property (“red” “blue”) 

• Occupy the same semantic niche (synon5mis: “program” “application”; 
h3rpem5mis: “pet” “eat”; anton3mis: “erase” “restore”) 

In the latter case, you may want to assign negative weights to some edges, 
which would make many network processing algorithms heartbroken. If your 
network has negatively weighted edges by constructlon, be prepared to remove 
them before analyzing the network. 

Knowledge specialists use semantic networks for graphical (and machtne-read- 
able) knowledge representation, and social and behavioral researchers and 
anthropologists use semantic networks for semantic domain analysis. Let’s 
have a look at two not-so-t5tpical semantic networks: a network of ke5rwords 
for fraud-related research papers and a network of characters from Othello. 

Detect Food Fraud 

Semantic networks often reveal surpristng facts about texts and other term 
collections (corpora). Suppose you do research in accounting—namely, in fraud 
—and want to know eveiything about fraud types. You understand that nobody 
knows fraud better than other fraud researchers and fraudsters themselves. 
The latter are typicaUy off limits, but the former are well represented tn numerous 
databases of academic research papers. You could collect all research papers 
that mention “fraud,” extract subject tags assigned to them by database edi- 
tors, and create a semantic network of the tags, based on their co-occurrence. 
The subject tags (such as DNA and meat industry) are the nodes of the net¬ 
work. Two tag nodes are adjacent if the tags are frequently assigned together 
to the same paper. For example, the nodes food fraud and/ood sqfety are 
adjacent because many research papers focus on food fraud and food safety. 

The original network (adapted from ConceptuaL Structure of Fraud Research 
and Its Dynamics [GZl 7 ]), shown in the figure on page 117 , is huge and could 
teli us many an exciting stoiy. However, we will look only at the circled frag- 
ment in the bottom left corner. 
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Semantic Networks "II? 



The most striking conclusion from the figure is that the selected fragment is 
almost entirely removed from the rest of the network. It is connected to the 
hulk of the network with only one edge. 

The figure on page 118 depicts the close-up view of the fragment. Rememher 
that as a rule, node size represents node importance (in our case, the numher 
of tags), and edge width represents edge weight. The nodes are colored based 
on their membership in network communities [Outltne Modularity-Based 
Communities, on page 136), but this property is not relevant now. 

Just by glancing through the node labeis, you can see that the topic of the 
fragment is food fraud, also known as adulteration (no connection to adulteiy, 
adult Stores, or any other “adult” business). Apparently, there is “fraud,” and 
there is “food fraud!” Within the “food fraud” fragment, you can see tags 
related to fraud objects (“milk,” “olive oil,” “meat”); fraud detection methods 
(“spectroscopy,” “DNA,” “principal component analysis”); fraud prevention 
mechanisms (“food labeling,” “identification”), and so on. If you are a PhD 
student or young postdoc looking for a future fraud-related research direction, 
you may be excited to have come across this semantic network fragment. 
Judging by its secluded position, few “hardcore” fraud analysts know or care 
about food! Why not become one of those who do? 
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restriction, fragment, length, .pGlymorphism 


metafi^mics 


ass, spectrometry 



linear, discritm^^S 

deuterated, trjglTOPi^^ffatd^_ 

fourier, transform, InfrariarsBVei^lsaByitittir^ 

diffuse, re^legt gB^^r^^B Lyi 

diffuse, refiectaVic^, intraredrwri^ tranaerm, spectroscopy, drift 

attenuated, total, refiectanc^ourier,fra^sform, infrared, spectroscopy, atr, ftir 


By the way, we will show you how to construet a similar network step-by-step 
m Chapter 12, Case Study: Performing CulturalDomainAnalysis, onpage 141. 

Expose a Protagonist 

Who is the protagonist (or the main character, if you prefer less academie 
speak) of Othello? Be eareful with what you answer because a trivial question 
like this must be tricky. Hint: No, Othello is not a protagonist. At least not 
by the standards of semantic networks. 

The emergmg field of digital humanities uses eo-occurrence semantic networks 
to analyze texts: plays, Scripts, and other forms of prose and poetiy. The 
method allows us to identrfy the main and peripheral characters (see core- 
peripheiy analysis on page 129); group charaeters and places (see Outline 
Modularity-Based Communities, on page 136); and eventually break down the 
stoiyline into seenes suitable, say, for film or stage adaptation. 

Let’s outline a semantic network construction from the text of Othello. After 
you read the next chapter and the case studies, you will be able to implement 
the algorithm in P 3 dhon. This exercise is mspired by Measuring Tie Strength 
in Implicit Social Network [EGI 2]. 

1. You need a list of all charaeters. Othello is a short text; you can compose 
the list by hand. Altematively, frnd all referenees to Enter and Exit remarks; 
or collect referenees to all charaeters as they speak if there is a property 
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in the text that identifies the characters. For example, a character may 
be marked with an HTML tag, as in <A NAME=speechl><b>RODERIGO</b></a>/ 



First(Musician 


\Zy 'EMILIA 

BI^NGA 

GR@NoVo^,P 


iUKCPF «enice 




rgiiiiiaiiiiriijfl 


^^TCDitfdSfenator 


Gel'!tieriii 

W 


You need a definition of co-occurrence. Play Scripts are perfect from this 
point of view: two characters co-occur if they occur in the same scene! In 
a general text, co-occurrence may be based on paragraphs, sections, 
chapters, pages, and so on. 

Now that you have characters (nodes) and their co-occurrences (edges), 
you can build a network. Remarkably, once constructed, this network is 
a social network, of which you heard so much in Chapter 6, Understanding 
Social Networks, on page 53. The resuit is shown in the following figure. 


Third ^^leman 
First (^^emari ) 
Second^G^tleman 

\FourtntfGer1jtleman 


2 . 


3. 


4. Finally, you need a measure of importance. How do you know, indeed, 
who is the protagonist of the stoiy? Luckily, you have the whole box of 
network centralities [Choose the Right Centralities, on page 92) that you 
can apply to each node. When you work with a social network, and the 
network in the figure is a social one, the best importance measures are 
betweenness and eigenvector centralities. The eigenvector centrality is 
proportional to the graph node sizes, and the betweenness centrality is 
reflected by the node color (the darker, the more Central). Both centralities 
seem to be in good agreement: lago is the protagonist. Not Othello. 


1. shakespeare.mit.edu/othello/full.html 
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So, is lago indeed the protagonlst of Othello? Some researchers strongly believe 
that he isP Welcome to digital humanities! 

Product Networks 

A product network is a network of retail items. Network nodes in a product 
network represent items purchased by individuals and co-occurring in thelr 
shopping baskets or carts. You can connect two product nodes if customers 
often or always buy the respective products together. We call such products 
complements. Left and right shoes (if sold separately), nuts and bolts, nails 
and hammers, and one-way airline tickets from Boston to Seattle and from 
Seattle to Boston are good examples of complements: when you buy one, you 
almost always buy the other as well. 

Product networks can (but do not have to) be weighted: you can define the 
weight of the edge as the frequency of co-purchasing. You can slice [Slice 
Weighted Networks, on page 79) the network later to remove low-weighted 
edges, if you want. 

Sometimes product networks allow negatively weighted edges. If one of the 
products in a pair is a reasonable replacement for the other—in some sense! 
—we call them substitutes. If you live in Alaska and buy a husky to pull your 
sled, then you probably won’t buy a reindeer for the same purpose, at least 
not at the same time. (You can stili get a reindeer as a pet.) A husky and 
reindeer are substi tutes; you can connect the respective nodes with a nega¬ 
tively weighted edge to represent their substitutive nature. 

Here are two product networks for you, as a warm-up: a network of common 
cooked food ingredients and a network of tools and materials for a painting 
do-it-yourself project. 

Explore Your Pantry 

To find a product network, look no further than your pantiy. 

When you buy prepared food (say, a can of baked beans), you buy an elaborate 
concoction of Ingredients: prepared beans, water, sugar, applewood smoked 
bacon, molasses, textured vegetable protein, and many others. You can think 
of the ingredients as separate products that happen to be packed together in 
the can. They occur in the same place at the same time—therefore, they are 
excellent candidates for becoming product network nodes. By constructing 
a product network, you can learn which ingredient combinations are most 


2. http://www.shmoop.com/othello/antagonist.html 


report erratum • discuss 







Product Networks • 121 


common, whether and how the ingredients group, and which Ingredients are 
Central to our food. 

You can collect data for a network of ingredients from the website of the 
United States Department of Agriculture (USDA®). There is no need to down- 
load all several hundred thousand product descriptions. For starters, we 
suggest crawling a couple of thousand pages—for example, 925 products with 
356 distinet ingredients. 

In the following figure, two ingredient nodes are connected if they happen 
together in more than five food items (the threshold of five was chosen to keep 
the network connected but not too hairy). 
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3. n(jb,nal.us(ja.gov/ndb/search 
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For each node, we calculate its betweenness (color) and eigenvector (size) 
centralities. The most Central nodes represent the core ingredients. Not sur- 
prisingly, the top three ingredients are salt, sugar, and water: they go almost 
Into any food item, and they almost always go side by side. The composition 
of the second ring is less predictable. The next ten most Central ingredients 
are: citric acid (acidifier), maltodextrin (sweetener), xanthangum (thickener), 
enz 3 mies (catalysts), natural flavors, wheat flour, niacln (vitamln B), riboflavin 
(another kind of vitamin B), folic acid (yet another kind of vitamin B), and 
lecithin (emulstfier). Most of these ingredients are responsible for foundational 
taste and texture food properties, which explatns their position in the network. 


Design a Do-lt-Yourself Store 

Networks of products are common in marketing analysis. Marketing specialists 
construet product networks to reveal tightly coupled groups of products fre- 
quently purchased together. Retailers may compactly stock the products in 
a group in Stores for the ease of shopping. If someone buys a product from a 
group, they may be reminded to buy other products from the same group. 
Finally, a group of products may be a stepping stone in a long-term customer 
project (for example, someone purchasing masomy products may be building 
a garage and would later need carpentiy tools and materials, followed by 
brushes and palnts). 


The followmg figure shows a product network for a pre-painting project, 
derived from the sales data collected and provided by a Fortune 500 specialty 
retalier (adapted from Building Mini-Categories in Product Networks [ZLZ15]). 
The network has only nine nodes, four of which are Isolated from the other 
five (we call them isolates—^you will leam more about them in Locate Isolates, 
on page 125). Apparently, the isolated products were insignificantly cormected 
to their neighbors, and the network analyst decided to drop the thin edges. 
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If you were to design an ideal do-lt-yourself store, you would put water 
applicators, tack cloths, sanding sheets, sanders, and mineral spirits on the 
same shelf, and the other four products on the next shelf. If your customers 
bought a sanding sheet, but no sander, you (or your recommendation system) 
would remind them to purchase the sander and another seven items as well. 

You leamed about two uneommon types of eomplex networks—semantic 
networks ofwords and concepts and eo-purchaslng-based product networks. 
The latter can be found in marketing research; the former apply to text anal- 
ysis and knowledge representation. You will see a complete example of a 
product network construction and analysis in Chapter 13, Case Study: Going 
from Products toPrqjects, on page 153. However, beforeyou construet something 
big, you will learn how to deconstruct a network into compact blocks. In the 
World of large complex networks, dividing and conquering is the only way to 
manage complexity. The next chapter will show you how to unearth the net¬ 
work structure. 
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Divide et impera. 

Attributed to Philip ii, King ofMacedon 


CHAPTER 1 1 

Unearthing the NetWork Structure 

You’re probably not going to be surprised that complex networks have...a 
complex structure. From the anclent times to modern days, researchers and 
practitioners have confronted complexity by divlding a complex system into 
smaller parts—constituents—and then taking a closer look at the parts and 
maklng sense out of them. A part could be as small as a slngle network node 
or as large as a so-called giant connected component (GCC). (You wlll meet 
a GCC later on page 128; for now, it suffices to know that it is giant.) You need 
a “network-o-scope” to zoom in and out—and you’re going to build it in this 
chapter. 

You will leam how to dissect an original complex network into constituents 
of various sizes, shapes, and t 3 rpes: isolates, connected components, eliques, 
communities, and k-cores, to mention but a few. You will understand the 
function and place of each t 5 q)e of constituent in the network analysis work- 
flow. At the end, you will get some suggestions about naming the extracted 
parts, because, as a rule of thumb, you cannot successfully use something 
that does not have a name. In other words, you will be ready to divide and 
conquer complex networks. (Whlch, sadly, did not save the author of these 
words, King Philip 11 of Macedon, from an assassination.) 

Locate Isolates 

The smallest distinet element of any network is an isolate: a node that is not 
connected to any other node (an isolate can stili be connected to itself wlth 
a loop edge). Though isolates belong to a bigger network, their veiy existence 
is against the networklng spirit, because the whole idea of networking is that 
of connectedness. An example of an isolate in a semantic network is a word 
that has no synon 5 mis, no homonyms, no antonyms, and no other relation- 
shlps to any other word (say, “sphygmomanometer” in a network of simple 
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synon 3 mis—because it has none). An example of an Isolate in a produet net¬ 
work is an item that nobody ever buys together with any other item. The last 
meal eomes to mind, but then, again, nobody pays for the last meal, so 
technieally it is not even a purehase. 

As a network analyst, you want to identily isolates and research the reasons 
for their isolation. Have you overlooked an edge while eonstrueting the net¬ 
work? (Go over the network construction proeess one more time.) Have you 
replaced negative ties in a signed graph with zero-weight positive ties and 
later disearded them? (Check if there is a better way to preserve negative ties.) 
Have you slieed a weighted network too aggressivel}^? (Slice Weighted Networks, 
on page 79. Select a lower slicing threshold, if possible.) If the reasons seem 
to be valid, there is no more need to keep the isolates m the network. Locate 
them with nx.isolates(G), include their names or count into the final report, and 
chop the isolates off: 

G = nx.Graph() 

G.add_nodes_f rom("/4BCD") # No edges -- all nodes are isolates 
my_isolates = nx.isolates{G) 

< ['D', 'C', 'B', 'A'] 

G.remove_nodes_from(my_isolates) # No more isolates! 
my_isolates = nx.isolates(G) 

< [] 

Split Networks into Connected Components 

A connected component is a subset of network nodes such that there exists 
a path {Think in Terms of Paths, on page 88) from each node in the subset to 
any other node in the same subset. An isolate is a special case of a connected 
component: there is only one node in the subset, so no path is even needed! 
The figure on page 127 shows a network with three connected components: a 
larger component A, smaller component B, and isolate C. A fictitious network 
traveler can get from any node in A to any node in A, but not to any node in B. 

If a network is drrected, it may have weakly and strongly connected compo¬ 
nents. In a strongly connected component, there is always a directed path 
from any node of the component to any other node of the same component. 
In a weakly connected component, you are allowed to travel one-way streets 
in the wrong direction (drive responsibly!), if this is what it takes to get from 
the source to the destination. Both components A and B in the figure are 
weakly connected, but B is also strongly connected. (No nodes are reachable 
from a2!) 
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NetworkX provides two families of functions for component analysis. Let’s 
experlment with the network from the previous figure: 

make-figures.py 
F = nx.DiGraph() 

F.add_node{"C") 

F. add_edges_f rom( [ ( "6" , "/3(9"), ("/30", “bl"), (“bl“ , "6")]) 

F. add_edges_f rom( [ ( "7\" , "aO"), ("aO", “al"), ("al", "a2"), ("al", "a3"), 

("a3", "4")]) 

Functions nx.connected_components(G) (implemented only for undirected net- 
works), nx.strongly_connected_components(F), and nx.weakly_connected_components(F) (both 
implemented only for directed networks) take a network of the appropriate 
type as the parameter and retum a generator of sets of nodes that belong to 
the namesake components. You can coerce the generator to a llst, If necessaiy: 

list(nx.weakly_connected_connponents(F)) 

< [{'A', 'a0', 'a2', 'al', 'a3'}, {'B', 'bO', 'bl'}, {'C'}] 

list(nx.strongly_connected_components(F)) 

< [{'a2'}, {'A', 'aO', 'al', 'a3'}, {'B', 'bO', 'bl'}, {'C'}] 

G = nx.Graph(F) 

list(nx.connected_components(G)) 

< [{'A', 'a0', 'a2', 'al', 'a3'}, {'B', 'b0', 'bl'}, {'C'}] 
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Note how we convert the dlreeted graph F into an undirected graph G in the last 
expression. The connected components of the converted graphs are the same 
as the weakly connected components of the original graph. You can use the 
ohtained node sets to extract the respective suhgraphs from the origmal graph: 

wcc = nx.subgraph(F, list(nx.weakly_connected_components(F))[1]) 
len(wcc) 

< 2 

Alternatively, you can use the other family of functions: 

• nx.connected_component_subgraphs(G) (implemented only for undirected networks) 

• nx.strongly_connected_component_subgraphs(F) 

• nx.weakly_connected_component_subgraphs(F) (the latter two functions are imple¬ 
mented only for dlreeted networks) 

These functions take a network as the parameter and retum a generator of 
Graph of DiGraph ohjects, depending on the original network type. So, ifyou only 
want to know which nodes helong to what component, the functions wlthout 
the subgraphs sufflx may save you some time. If you have further operatlons 
m mind, go with the second family. 

Most co-occurrence networks, hy construction, are undirected. Indeed, the 
fact that items A and B are in the same place at the same time implies that 
A is in the same place with B and the other way around. Thafs why for the 
rest of the chapter, let’s assume that all networks are undirected and all 
connected components are just connected components, without any references 
to their strength or weakness. 

One of the components in a complex network often dominates the others: not 
so much hecause it is strong, hut hecause it is giant. The giant connected 
component (GCC) is simply the largest component hy the node count. NetworkX 
does not provide a function for extracting the GCC, hut you can stili flnd it 
hy callmg one of the functions mentioned previously, reverse sorting the 
generated list hy size, and taking the first element: 

comp_gen = nx.connected_components(G) 

gcc = sorted(comp_gen, key=len, reverse=True)[0] 

The size of a GCC in complex networks typically ranges hetween 80 and 100 
percent of the fuU network size. You may want to treat each smaUer component 
as an indivlsihle structural unit of the network and focus your attention on 
the GCC. As a honus, you wlll spare yourself from rememhering which algo- 
rithms and functions apply to discormected networks and which do not. 
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^ What Makes Components Giant? ^ 

Accordlng to Albert-Laszlo Barabasl, prominent complex network researcher, most 
complex networks evolve as a resuit of preferentlal attachment. PreferenUal attachment 
(also known as the “rich get richer,” “80/20,” or the Pareto princlple) suggests that 
when a new node Jolns a network, It Is likely to attach itself to a node with the hlghest 
degree. Thus, the degree of the node wlth the hlghest degree becomes even higher, 
and the connected component that contains that node grows faster than all other 
connected components, leadtng to the emergence of the GCC. The converse is also 
true: If a network has a GCC, it Is likely a resuit of preferentlal attachment. 

V J 

Separate Cores, Shells, Coronas, and Crusts 

The only valuable property of a connected component Is Its connectedness. 
There is always a way to get from any node A in a component to any other 
node B in the same component. The property of connectedness is glohal and, 
while important for social and communication networks (where paths are 
responsible for information diffusion), may not be adequate for semantic, 
product, and other types of networks, where direct or short-haul connections 
are more essential. Consider a network of s 3 monyms: “emerald” is a synonym 
of “green,” and “green” is a synon 3 mi of “ecological,” but “ecological” is hardly 
a synonym of “emerald.” 

Let’s zoom into a connected component (say, in the GCC) and tiy to find more 
elements inside. 

One of the fundamental tools in modem sociology is core-peripheral analysis. 
A social network, thereby, consists of two sets of nodes: the core (the nodes 
that are more or less tightly interconnected) and the peripheiy (the nodes 
that are tightly connected to the core, but only weakly, if at all, connected to 
the other peripheral nodes). The graphs of core-peripheral networks often 
have a “haiiy” appearance: their dense “body” is adorned with “pendulums,” 
multi-edge self-loops, and so on. 

Social networks are not the only networks known to have the core-peripheiy 
structure. As another example, networks of Journal citations reveal the same 
pattern, where the core consists of papers published in prominent joumals 
m the field, and papers published m marginal joumals populate the peripheiy. 

Traditional social network analysis attacks the core-periphery decomposition 
via fuzzily defined blockmodeling. We, on the other hand, will start by intro- 
ducing a more mathematically rigid classtfication of nodes into four categories: 
cores, shells, coronas, and crusts. 
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A core or, more accurately, a k-core (where k could be any non-negative integer 
number) is a subgraph of the original network graph such that each node in 
the subgraph has at least k neighbors. A 0-core is, naturally, the whole graph. 
A 1-core is a graph with no isolates. A 2-core is a graph where no node has 
fewer than two neighbors (no node is a part of a pendulum). Any graph usu- 
ally has more than one eore; the core with the largest possible k is called the 
main core. A k-core construction process is iterative: 

1. Start with the original graph and remove all nodes that have a degree 
smaller than k and all the incident edges; this will probably resuit in some 
of the remaining nodes losrng their neighbors and their degree decreasing. 

2. Some nodes that have k neighbors or more may have fewer than k neigh¬ 
bors after trimming; remove them, too, and iterate until no remaining 
node has fewer than k neighbors. 

3. The remaining nodes form the k-core. 

A k-crust is what is left of the original network when we remove the k-core. 
In other words, the crust is the peripheiy. 

A core has its intemal structure. The subgraph of the k-core in which all 
nodes have exactly k neighbors in the core is called a k-corona. Unlike crusts, 
coronas are not necessarily connected and may consist of unconnected 
components—that is, unconnected within the corona. 

Finally, a subset of nodes in k-core but not in (k+l)-core, is called a k-shell. 
Just like a corona, a shell may consist of components that are not connected 
within the shell. Let’s experiment with the graph from the following figure. 
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make-figures.py 
G = nx.Graph( 

{{“Alpha", "Bravo"), {“Bravo" , "Charlie") , {"Charlie" , "Delta"), 

{"Charlie" , "Echo"), {“Charlie", “Foxtrot") , {“Delta", "Echo"), 

{"Delta", "Foxtrot") , {"Echo", "Foxtrot") , {"Echo", “Golf"), 

{"Echo", “Hotel"), {“Foxtrot", "Golf"), {“Foxtrot" , “Hotel"), 

{“Delta", "Hotel"), {"Golf", "Hotel"), {“Delta", "India"), 

{"Charlie" , “India"), {"India", “Juliet") , {"Golf", “Kilo"), 

{"Alpha", "Kilo"), {"Bravo", "Lima"))) 

NetworkX provides a useful collectiori of functions for calculating all k things. 
Each of them takes a graph and k as the parameters and retums the namesake 
core-peripheiy element (k is optional for nx.k shellO, nx.k_crust(), and nx.k coreO: 
they return the main shell, crust, and core by default): 

nx.k_core(G).nodes() # Round and square nodes and shaded edges 

< ['Golf, 'Charlie', 'Delta', 'Hotel', 'Foxtrot', 'Echo'] 
nx.k_crust(G).nodes() # Triangular nodes and shaded edges 

< ['Lima', 'Bravo', 'Kilo', 'Juliet', 'Alpha', 'India'] 

nx.k_shell(G).nodes() # Round and square nodes and shaded edges 

< ['Golf, 'Charlie', 'Delta', 'Hotel', 'Foxtrot', 'Echo'] 
nx.k_corona(G, k=3).nodes() # Square nodes 

< [ 'Golf , 'Charlie' ] 

Extract eliques 

Unlike the smaller components, the GCC and the k-cores are usually too large 
to be considered a stngle structural element. Depending on your tnterpretation 
of the nodes and edges, you should zoom in even further in a search for 
smaller network building blocks, such as eliques. 

A clique, or, more accurately, a k-clique is a subset of k nodes such that each 
node is directly connected to each other node in the clique. (We distmguish 
weak and strong eliques in directed graphs.) Cliques are also known as 
complete subgraphs. The nodes in a clique may be connected to other nodes 
as well, but they do not have to—that is, the degree of a node in a k-clique is 
at least k-1. The principal difference between cliques and connected compo¬ 
nents is that the path between any two nodes in a clique must have the length 
of 1, while in a component, the path length is limited only by the graph 
diameter [Think in Terms ofPaths, on page 88). 
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Any single node is a 1-clique, a monad. Any two connected nodes form a 2- 
clique, a dyad. A triangle of nodes—the resuit of transitive closure—is a 3- 
clique, a trlad [Explore Neighborhoods, on page 84). Monads, dyads, and triads 
are veiy eommon in complex networks. 

A maximal clique is a k-clique that cannot be made a (k+l)-clique by adding 
another node to it. For example, clique (Alpha, Bravo, .... Echo) in the following 
figure is a maximal clique, because Including any other node (Foxtrot, Golf, 
or Hotel) Into it invalidates the complete connectedness property. For example, 
If Foxtrot is included, then (Alpha, Bravo, ..., Foxtrot) is not a clique an 5 miore. 
The largest maximal clique in a network graph is called the maximum clique. 
(No, 1 did not invent this terminology!) 

Let’s experiment with the graph from the following figure: 



make-figures.py 

# Generate a 5-clique 

G = nx.complete_graph(5, nx.GraphO) 
nx.relabel_nodes(G, 

dict (enumerate (( "/llp/ia" , "Bravo", “Charlie" , “Delta", “Echo"))), 
copy=False) 

# Attach a pigtail to it 
G. add_edges_f rom([ 

{“Echo", "Foxtrot") , {“Foxtrot" , “Golf"), {"Foxtrot" , "Hotel"), 
{"Golf", "Hotel")]) 

NetworkX provides function nx.find cIiquesO for finding all maximal eliques in a 
graph (the largest of which is the maximum clique). The function returns a 
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list generator, and this time, the use of a generator is not a tribute to 
Pythonic programming style, but a dire necessity. As a matter of faet, larger 
eliques, espeeially maximal and maximum eliques, are rare and hard to 
find. Finding large eliques is a eomputationally veiy hard problem (known 
as an NP-complete problem) and listing all large eliques requires exponential 
time. Unfortunately, funetion nx.find cIiquesO generates eliques in a random 
order, but if you want to get only some maximal eliques, not all of them, 
then you can stop the generator whenever you want. The following code 
finds all three maximal eliques from the figure (1 highlighted the larger two 
in the figure): 

list(nx.flnd_cliques(G)) 

< [['Golf, 'Hotel', 'Foxtrot'], ['Echo', 'Alpha', 'Bravo', 

'Charlie', 'Delta'], ['Echo', 'Foxtrot']] 

You have at least two good reasons to search a network for k-eliques: a the- 
oretieal and an empirical one. In the theoretical ease, you may already have 
some prior knowledge about the network structure. For example, a marketing 
specialist may define a project basket in a product network as a collection 
of Products such that they are always purchased together (and, therefore, 
form a clique when represented as a network). Recognizing k-eliques in a 
product network almost instantly leads you to the discoveiy of project bas- 
kets. Closely cooperating teams in social and organizational networks are 
k-cliques, and such are collections of complete synonyms in semanties 
networks. 

In the empirical case, you use eliques as opaque network atoms. If you 
assume that an edge between two nodes is an indication of their significant 
simllarity, then a complete connectedness wlthin a clique implies overall 
significant similarity of the member nodes. Thus, you can replace all k nodes 
wlth one node that represents the entire clique, or wlth a newly minted 
“clique-node,” potentially significantly simplifylng the network topology. 
Funetion nx.make_max_clique_graph() generates a new graph by replacing each 
maximal clique wlth a new synthetlc node: 

synthetic = nx.make_max_clique_graph(G) 
synthetic.edges() 

< [(1, 3), (2, 3)] 

Naturally, you can replace eliques wlth sjmthetic nodes in the a priori case, 
too! Just be aware that this funetion first finds all maximal eliques, wlth all 
the NP-completeness implications of fmding all maximal eliques. 
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Recognize Clique Communities 

By definitiori, a clique is a veiy rigid and sensitive network structure. Removing 
an edge from a k-clique transforms it into two interwound, partially overlap- 
ping, adjacent (k-1)-eliques. In the figure on page 135, subgraphs Alpha-Delta 
and Alpha-Charlie, Echo are 4-cliques each, but the whole network is not a 5- 
clique, because the edge Delta-Echo is missing (1 show it as a dashed line). 

Intuitively, you may feel that all k nodes somehow belong together and the 
missing edge could have been a victim of a measurement or data entiy error 
or improper slicing [Slice Weighted Networks, on page 79) or conversion from 
a directed graph. Nonetheless, nx.find cIiquesO will not recognize your k nodes 
as a clique (because they are not). Instead, it will report two smaller eliques, 
leaving it up to you to notice that they actually share k-1 nodes and therr 
separation may have been caused by a missing edge. 


Luckily, NetworkX supports k-clique communities. A k-clique community is a 
Union of all k-cliques that can be reached through adjacent k-cliques. The 
process of reachmg all eliques in the union is called clique percolation [PDFV05]. 



When a Community Is Not a Community and a Cluster Is Not a Cluster 

Anthropologists and social scientists have a different idea of a 
community and may get easily confused at this point. For them, 
a community is a tightly knit group of people, not a group of 
abstract nodes. To avoid getting into pointless termmological fights, 
1 will sometimes refer to communities as clusters. Unfortunately, 
data scientists who may be reading this book, too, have a different 
definition of a cluster. Apparently, when it comes to network 
analysis, terminological fights are unavoidable. 


K-clique communities in complex networks are a curse and a blessmg. Why 
are clique communities a blessing? 


They provide a flexible substitute for tnflexible proper eliques. True, the nodes 
in a community are not in general directly interconnected; however, if the 
relationship represented by the edges is actually transitive (if A is adjacent 
to B, and B is adjacent to C, then A is supposed to be adjacent to C—as it 
would be in the case of two products always purchased together) and the 
missing edges resuit from network construction imperfection, then the 
length of the path between any two nodes in a clique community does not 
really matter. 


Why are they a curse, then? 
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Just like striet eliques, elique eommunities do not neeessarily partition the 
network and ean overlap with other elique eommunities. In other words, the 
same node may belong to more than one elique or elique eommunity. This 
may or may not be what you want. 

Whafs worse, if the relationship represented by the edges is only approximate- 
ly transient (if A is adjaeent to B, and B is adjaeent to C, then A is not neees¬ 
sarily adjaeent to C—as it would be in the ease of personal friendship), then 
two nodes separated by a multi-edge path may aetually have little or nothing 
shared, and their membership in the same elique eommunity would be 
questionable. 

Only you ean determtne whether elique eommunities are appropriate for your 
network. But onee you do, here’s the funetion for doing the dirty job: 

list(nx.k_clique_communities(G, k=3)) 

< [frozenset{{'Golf', 'Hotel', 'Foxtrot'}), 

frozenset({'Alpha', 'Charlie', 'Bravo', 'Echo', 'Delta'})] 

Note that a k-elique eommunity always has at least k nodes! 

Frozen Sets 

O A frozenset is an immutable version of a Python set. Beeause of its 
immutability, it ean be used as key in a dietionaiy, but it ean be 
east to a set, if any modifieations are neeessaiy. 
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Outiine Modularity-Based Communities 


The fuzziest and most flexible form of node organizatlon in 
a complex network is network communities based on modu- 
larity. They are sometimes also called clusters or groups, 
and are not to be confused with clique communities [Recog- 
nize Clique Communities, on page 134). 


This section uses 
community. 


Let’s start with modulaiity first, and assume that the network has been 
already partitioned into non-overlapping communities (later you’ll figure out 
how). According to Newman’s dejinition [New06], modularity m is the fraction 
of the edges that faU within the given communities minus the expected fraction 
if edges were distributed at random, while conserving the nodes degrees. The 
value of m is in the range from -0.5 (inclusive) to 1 (exclusive). If most of the 
edges are incident to the nodes within the same community, the modularity 
is veiy high, close (but not equal) to 1, and the proposed partition describes 
a veiy good community structure. The modularity of -0.5 means that the 
nodes within the same community are not adjacent at all—the proposed 
community structure is worse than random; in fact, you are probably dealing 
with anti-communities that induce hi- and multi-partite networks (as in Project 
Bipartite Networks, on page 178). 


Ideally, you want to partition a network in such a way that the modularity is 
as high as possible. The modularity of 0.6 and above corresponds to networks 
that have a clearly visible community structure. Unfortunately, getting the 
largest modularity is hard for at least three reasons: 


• The problem of optimal partitioning is NP-complete with respect to the 
network size. To find the best partition, you should calculate modularity 
for every possible partition and then select the best one. The number of 
partitions is simply too large, and the problem does not have a feaslble 
exact solution for any non-trivial graph. 


• Approximate Solutions (for example, the most popular Louvain algorithm 
[BGLL08]) do a pretty good Job, but some of them are probabllistic, which 
means eveiy time you run them, you may end up havlng slightly different 
partitions. 


• The resolution of modularity-based methods scales poorly, and they 
overlook small communities in large networks. A plausible solution is to 
partition the network recurslvely into smaller and smaller communities. 


Anaconda, the most popular Pjdhon distribution, does not currently include 
tools for modularity-based community detection. Fortunately, the tool exists 
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and can be easily installed via pip. It is called python-louvain. The extemally vis- 
ible name of the module is community, and you import it under this name. 

Module community uses the louvain algorithm that optimizes network modular- 
ity. The discovered communities are represented as a partition: a dictionaiy 
with node labeis as keys and integer community identifiers as values. The 
module also calculates the modulaiity of the partition with respect to the 
original network. 

part = community.best_partition(G) 

< {'Golf: 0, 'Bravo': 1, 'Delta': 1, 'Hotel': 0, 'Foxtrot': 0, 

'Charlie': 1, 'Alpha': 1, 'Echo': 1} 

community.modularityCpart, G) 

< 0.3035714285714286 

A Faster Way 

There is a faster implementatlon of the louvain algorithm provlded 

O through the module louvain (must be installed separately, too). 

However, it works only with graphs constructed in iGraph, not in 
NetworkX. 

Just as with eliques and clique communities, you may want to replace the 
smaller structural elements of a large network with symthetic nodes and build 
an induced graph: 

induced = community.induced_graph(part, G) 
induced.nodes() 

< [ 0 , 1 ] 

induced.edgesO 

< [(0, 0, {'weight': 10}), (0, 1, {'weight': 1}), (1, 1, {'weight': 3})] 

Note that the induced graph is weighted and has loops. The weight of an 
induced edge incident to the symthetlc community nodes is the number of 
edges in the original network that are incident to the nodes in the respective 
communities. 

Explore Modularity-Based Communities with Pandas 

If you’re familiar with Pandas, you can convert a network partition part (a dictio¬ 
naiy) into a Series for the ease of further processing. You can see the total 
number of communities, their sizes, and which nodes belong to which 
community: 
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part_as_series = pd.Series(part) 
part_as_series.sort_values() 


< Foxtrot 0 

Golf 0 

Hotel 0 

Alpha 1 

Bravo 1 

Charlie 1 

Delta 1 

Echo 1 


dtype: int64 

How big are the eommunities? 
part_as_series.value_counts() 

< 0 5 

1 3 

dtype: int64 

Perform Blockmodeling 

The construction of the graph of maxlmal eliques or eommunities is a speeial 
case of blockmodeling—grouping network nodes according to some meaningful 
definition of equivalence and replacing them with synthetic “supemodes.” A 
more general function nx.blockmodel(G,part) takes a graph G and its partition part as 
a list of node collections (lists or sets), and creates an induced graph. Unlike 
nx.make_max_clique_graph() and community.induced_graph(), nx.blockmodelO requires the 
partition includes eveiy node m the original graph exactly once. You can manu- 
ally remove the offending overlappmg chque from a chque partition, if you want: 

eliques = list{nx.find_cliques(G)) 

< [['Golf, 'Hotel', 'Foxtrot'], ['Echo', 'Alpha', 'Bravo', 

'Charlie', 'Delta'], ['Echo', 'Foxtrot']]—not good! 

synthetic = nx.blockmodel{G, [cliques[0], cliques[l]]) 
synthetic.edges() 

< [( 0 , 1 )] 



Not AII Blockmodeling Leads to the Same Rome 

For soclal scientists, “blockmodeling” often means a very different 
thing: separating a network into the core and peripheiy by way of 
rearranging rows and columns of the mcidence matrix. Blockmod¬ 
eling as understood by complex network analysts Is a generaliza- 
tion of the core-peripheiy decomposition. 
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Name Extracted Blocks 

From the data scientific point of view, network analysls at the macroscopic level 
(extraction of communities, eliques, and other structural blocks) is an example 
of unsupervised machine leaming. The goal of unsupervised machine leaming is 
to infer a network’s hidden structure in the absence of “labeis”: node and edge 
attributes (except, perhaps, the edge weights). 

The unearthed blocks suffer from two major interrelated problems: 

• It is not ciear what they mean. 

• They are nameless. 

In fact, if you knew the purpose or nature of a block, you would glve it a name, 
and if you knew the name, you would guess what Its purpose or nature is. 

Selectmg a good name for a block can be done in at least three ways. 

• You can use your intelligence: look at the individual node labeis and gener- 
alize. A block that has labeis “car,” “truck,” “traln,” and “sled” probably 
deserves to be called “land transportation,” and “hand,” “arm,” “leg,” “head,” 
and “chest” belong to the block “body parts.” If unsure or confused, hlre a 
subject matter expert (SME) whose job is to know why nodes X and Y ended 
up tn the same block. 

• Better yet, hire a lot of subject-matter experts—or sort-of-experts. Amazon 
Mechanical Turk (AMT) [BKGl 1 ] offers a way to put any questlon to literally 
thousands of people (“workers,” in the AMT terminology) for a very modest 
price. Ask 10,000 AMT workers what “foos,” “bars,” and “foobars” have tn 
common. If the terms have anything in common at all, you will most probably 
get an answer supported by a solld majorlty of workers. 

• Finally, if you cannot hire an SME and would rather not mess wlth AMT, you 
can stili generate block labeis from its data. If the nodes in a block differ—for 
example, in size or weight—take the largest of them (say, “head”) and use its 
label to synthesize the block label (for example, “the ‘head’ group”; see Interpret 
the Results, on page 150 for a better example). If all nodes have the same 
attributes or have no attributes at all, choose the first node in the alphabetic 
order (“the ‘arm’ group”). 

In this chapter, you learned how to dissect a complex network and explore its 
anatomy. You know how to locate Isolated nodes and components; identify cores, 
shells, coronas, and emsts; and compute node eliques, clique communities, and 
modularity-based communities (sometimes referred to as clusters). Now, it is time 
to apply the freshly mlnted “network-o-scope” to a couple of real-world semantic 
and product networks. 
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... Culture opens the sense ofbeauty. 

Ralph Waldo Emerson, American essayist, lecturer, 
and poet 


CHAPTER 1 2 

Case Study: Performing Cultural 

Domain Analysis 

Cultural domain analysis (CDA, Analyzing Qualitative Data. 
Systematic Approaches [BWRl 7J) is the study of how people 
in groups think about lists of terms that somehow go 
together and how this thinking differs between groups. Some 
people associate “candle” with “Christmas,” others with “hurricane” and 
“blackout,” and yet others with a “self-injury” (cutting/buming their skin) 
toolset. Anthropologists, ethnographers, psychologists, and sociologists use 
CDA to understand semantic mindscapes of social, ethnic, religious, profes- 
sional, and other groups. Before personal computers and specialized CDA 
Software became available, social scientists used to do CDA essentially by 
hand. But not anymore! Python comes to rescue. 

You don’t have to be an anthropologist, ethnographer, psychologist, or sociol- 
ogist to read this chapter. Regardless of your background, you will learn how 
to harvest semantic data from a popular blogging website and cache it locally 
for further efficient access. You will see how to convert natural language units 
into terms and build, analyze, and interpret a semantic network reflecting 
the interests of the fans of The Good Wtfe, a CBS TV show. Hopefully, you will 
be able to extend the same approach to other shows, other websites, and 
other tag corpora. 

The complete code for this case study is available in the file Ij.py.^ 


This chapter uses NLTK, 
Pandas, NumPy. 


1. pragprog.com/titles/dzcnapy/source_code 
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GettheTerms 

Start by importlng all necessary modules and defining the domain (LJ com- 
munity) of interest. We suggest using Pandas and NumPy, the power tools of data 
seienee, and NLTK —the Natural Language Toolkit—in this project, as well as 
some other libraries, so you need to import them. (If you last used them a 
while ago, you might want to blow the dust qff your skill set [ZinlG].] 

U-py 

import urllib.request, os.path, pickle # Download and cache 
import nltk # Convert text to terms 

import networkx as nx, community # Build and analyze the network 
import pandas as pd, numpy as np # Data Science power tools 

Your next step is to get and eache term lists. A term is a unit of CDA. It ean 
be a Word, a word group, a word stem, or even an emotieon (emoji). CDA looks 
Into similarities between terms that are shared among a reasonably homoge- 
neous group of people. So, you need to find a reasonably homogeneous group 
of individuals, a list of terms, and a way to assess their similarity. 

A great source of semantic data is Live Journal (LJ)—a eolleetion of individual 
and communal blogs with elements of a massive online social network.^ 
LiveJoumal has an open, easily accessible API, and encourages the use of 
publie data for research. Unfortunately, LJ membership and aetivity peaked 
in the early 2000s, but the site stili hosts some lively blogging eommunities 
(sueh as the celebrity gossip blog “Oh No They Didn’t!” ^), and it serves rich 
layers of historical data. 

LJ eommunities eonsist of individual members that have their private blogs, 
profiles, frlend lists, and interests (online identity markers). In faet, LJ treats 
eommunities and users as same-elass eitizens: eommunities, like individuals, 
have their profiles, Interests, and even “friends.” The URLs of user/commu- 
nity profiles, interest lists, and friend lists have a regular structure. If the- 
goodwijejcbs is the name of a community (you will use it in the rest of the 
study), then thegoodwife-cbs.livejournal.com/profile/ is the URL of the community 
profile (note that the underscore was replaced by a dash), www.iivejournai.com/ 
misc/fdata.bmi?user=thegoodwife_cbsSiComm=l is the friend list, and www.iivejournai.com/ 
misc/interestdata.bmi?user=thegoodwife^cbs is the interest list. 

For the purpose of this mini-study, let’s define two terms to be similar from 
the perspective of an LJ community if they are consistently listed together on 


2. www.livejournal.com 

3. ohnotheydidnt.livejournal.com 
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different interest llsts of the eommunity members. Your first job is to obtain 
and process the eommunity membership list. A t3q3ieal list looks like this: 

# Note: Polite data miners cache on their end. Impolite ones get banned. 

# Note: thegoodwife_cbs is a eommunity account 
P> emploding 

P> poocat 
«...more members^.» 

P< brooketiffany 
P< harperjohnson 

The first line of the document makes a significant point: tf you download 
something once, you should not download it again. Create the directoiy cache 
and store all downloaded data into it. If you run CDA on the same data again, 
it will hopefully stili be in the cache, assuming that interest lists and commu¬ 
ni ty membership are reasonably stable. 

U-py 

LJ_BASE = ''http://www.livejournal.com/misc'' 

DOMAIN_NAME = "thegoodwife_cbs" 

cache_d = "cache/" + DOMAIN_NAME + ".pickle" 
if not os.path.isfile(cache_d): 

domain = download(LJ_BASE, DOMAIN_NAME) 
if not path.os.isdirCcache"): 
os.mkdir( "cache" ) 

with open(cache_d, "wb") as ofile: 
pickle.dumpidomain, ofile) 

else: 

with open(cache_d, "rb") as ifile: 
domain = pickle.load(ifile) 

This code fragment uses the module pickle —native Pjdhon data serializer 
(Export andimport Networks, on page 30). If the cache directory and file exist, 
function pickle.load() deserializes the data object. Otherwise, create the directory 
and file and call function pickle.dumpO to serialize the data object domain and 
save it into the file. Note that you must open the file in the binaiy mode. The 
compressed cached pickle file is available as thegoodwife cbs.pickle.zip.^ 

The rest of the eommunity membership list on page 143 is a two-column table. 
The first column represents some subtle aspects of membership types (“P” 
for indivldual users, “C” for communitles, “<” for “friends,” “>” for “friends- 
of j; the second column has usernames. To keep your code modular, write a 
function download(domain_name) that takes care of this and similar tables. 


4. pragprog.com/titles/dzcnapy/source_code 
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U-py 

def download(base, domain_name): 

Download interest data from the domain_name community on 
LiveJournal, convert interests to tags, create a domain DataFrame 

O members_url = " {}/f data. bml?user-{}&comin=l" .format {base, domaln_nanie) 
members = pd.read_table(members_url, sep=" ", 

comment="#", names={" direction" , "uid")) 

e wnl = nltk.WordNetLemmatizer() 

stop = set(nltk.corpus.stopwords.words(' englisb ')) | set(('&')) 

O term_vectors = [] 

O for User in members.uid.unique( ): 

print ( "Loading O" ■format(user)) # Progress indicator 
userurl = "{}/interestdata.bml?user={}" .format{base, user) 

© try: 

with urllib. request.urlopen(user_url) as source: 
raw_interests = [line.decode{).lower().strip() 

for line in source.readlines()] 

except : 

printCCould not open .format(user_url)) # Error message 

continue 

if raw_interests[0] == '/ invalid user, or no interests' : 

continue 

O interests = [" " .join(wnl.lemmatize(w) 

for w in nltk.wordpunct_tokenize(line)[2:] 
if w not in stop) 
for line in raw_interests 
if l ine and line[0] != "#"] 

interests = set(interest for interest in interests if interest) 
term_vectors.append(pd.Series(index=interests, name=user).fillna(l)) 

return pd.DataFrame().join(term_vectors, how= "outer" ).fillna(0)\ 

O .astype(int) 

O Convert the interest table into a two-column DataFrame. 

O Prepare a WordNet lemmatlzer wnl—a tool for convertlng words into lemmas 
—and a list of stopwords stop, extended to include You will need this 
list to eliminate too frequent words. Coerce the Standard list of stopwords 
into a set for faster lookup, because the list lookups in Python are notori- 
ously slow. 

Your next step is to download all interest lists, convert interests into terms 
(they are not always the same!), and combine all term lists into a term matrix 
—a matrix whose columns are term lists. An interest list looks veiy similar 
to the membership list—even the call to caching is the same: 


report erratum • discuss 


GettheTerms • 145 


# Note: Polite data miners cache on their end. Impolite ones get banned. 

# <intid> <intcount> <interest ...> 

18576742 1 +5 sexterity 

624552 7 a beautiful mess 
18576716 1 any more hot chicks? 

44870 28 seriously? 

1638195 94 shlny! 

«...more interests...» 

This is stili a table (showing interest ID, the system-wide rank of the interest, 
and the actual interest), but the number of eolumns differs, depending on 
how many words are in the interest deseription. Pandas DataFrames are poor 
parsers for irregular texts; tackle the eolumns by hand, using low-level Python 
tools. Note that if the usemame is not found or the user has not deelared any 
interests, the content is entirely different: 

! invalid user, or no interests 

So, let’s eontinue the funetion code inspeetion: 

e Set up an empty list accumulator term vectors. At the end of the loop, it 
becomes a list of term vectors—raw material for the term matriK. 

O Loop through all unique usernames in the community, because you need 
the URLs of their interest lists. 

O Obtain an interest hst rawjnterests as a Python list of strings for each distinet 
community member in the try-except block. If the URL faUs to open (for any 
reason beyond your control), the script does not crash but politely Informs 
the programmer of the failure and proceeds to the next user. The same thing 
happens when the user has no interests or does not exist at all. If everything 
goes fine, decode and strip each string of tralling whitespaces. LJ interest 
hsts are always in lowercase, but if they are not, a call to lower() ensures that 
m the rest of the script you compare apples to apples. 

O Split each non-empty, non-commented interest into individual words with 
nItk.wordpunct tokenizeO. There may be more than one word in an interest 
deseription, and some or all of these words may be forms of other words 
(as in “chicks”—“chick”). Text analysis practitioners preach different ways 
of handling word forms. Some suggest to leave the forms alone and treat 
them as words on their own. Some advocate lemmatizing or stemming: 
reducing a form to its lemma (the Standard representation of the word) 
or to the stem (the smallest meaningful part of the word to which affixes 
can be attached). A lemmatizer reduces “programmers” to one “program¬ 
mer,” and a stemmer, depending on its zeal, yields a “programm” or even 
a “program.” Let’s follow the lemmatizing crowd. Furthermore, almost 
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eveiyone agrees that certain words (such as “a,” “the,” and “and”) should 
never be counted at all. Remove them. 

O As a resuit of lemmatlzation, stemming, and stopword elimination, a term 
list may end up havlng duplicates (for example, “the chicks” and “a ehlek” 
may both become “ehick”). Convert each term list into a set. Surely, sets 
have no duplieates. 

O Your goal is to produee a term vector model (TVM)—a table where rows 
are terms and columns are community members.® In the Pandas language, 
this table is known as DataFrame, and its columns are known as Series. 
Transform a term list to a Series, where the individua! terms become the 
Series index, and all values are set to Is, like this: 

print (te rm_vecto rs) 

< shiny! 1 

+5 sexterlty 1 

big damn hero 1 

«...more terms...» 

Name: twentyplanes, dtype: float64 

O Finally, join all Series into a DataFrame. This operation involves the mysteiy 
of data alignment, whereby all participattng Series are stretched vertically 
to have their row labeis (terms!) aligned. Such stretching almost inevitably 
produces empty cells m the frame. Fili them up with zeros: an empty cell 
in row A and column B signals that the term A was not on the list B. The 
resultmg variable DataFrame has 12,437 rows and adequately represents 
the cultural domain of Live Journal users interested m the CBS show The 
Good Wife. 

Build the Term NetWork 

The next CDA step requires that you build a network of terms: a graph where 
nodes represent terms, and (weighted) edges represent their similarities. 

You could have included all 12,437 discovered terms in the graph, but some 
of them are mentioned only once or twice (which is expected, given Zipfs 
law®). Rather than wondering why the less frequently used words are in fact 
less frequently used, remove all rows with fewer than ten occurrences, but 
provide an option of changing the cut-off value MiN SUPPORT in the future. At 
this point, you might wish that Python had first-class constants, but 


5. en.wik:ipedia.org/wiki/Vector_space_model 

6. mathworld.wolfram.com/ZipfsLaw.html 
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nonetheless spell MIN SUPPORT in ali capital letters. DataFrame limited is a truncated 
version of domain: it has only 319 rows. 

^ Zipf s Law ^ 

Zipfs law States that given some corpus of terms drawn from a natural language, the 
frequency of any word Is Inversely proportlonal to Its rank by frequency. In other 
words, If the frequency of the most frequent term is fg, then the frequency of the next 
most frequent term is fo/2, and so on, and the frequency of the nth term is fg/na. The 
same law (with the slightly different exponent a) applles to populatlon ranks of clties, 
Corporation slzes, Income ranklngs, and more. The continuous form of the discrete 
Zipf law is known as Pareto distrlbutlon, and Zipf s law is a special case of the power 
law (mentioned on page 106). 


U-py 

MINSUPPORT = 10 

sums = domain.sum(axis=l) 

limited = domain[sums >= MIN_SUPPORT] 

Since you want to build a network based on co-occurrences, you can consider 
two terms as simllar if different community members frequently use them 
together. Calculate the matrix of co-occurrence by matrix-multiplying the 
limited DataFrame by itself. 

The resultlng square DataFrame cooc contalns the total counts of all terms on 
the maln dlagonal (suppress them by multiplying the matrix by an inverted 
Identlty matrix) and the counts of co-occurrences elsewhere (they wlll eventu- 
ally become welghted network edges). 

U-py 

cooc = limited.dot(limited.T) * (1 - np.eye(limited.shape[0])) 

Slice the Network 

Now, you must make another painful decislon: which matrix elements become 
edges and which get discarded? Slice Weighted Networks, on page 79, explams 
the slicing philosophy. Choose the slicing threshold, SLICING, to be stx. Higher 
SLICING results in many small communities. Lower SLICING results in few large 
communities. Stx seems to be a good eompromlse between count and slze. 

The resulting matrix is veiy sparse (eveiy cell represent an edge, but we agreed 
to have as few edges as possible!). Stack and normalize it—essentially convert 
Into a sparse matrix, where each row represents a significant edge and its 
weight. Since NetworkX prefers to deal with P 5 rthon (rather than Pandas) data 
structuros, eonvert the weights to a dictionaiy: 
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U-py 

SLICING = 6 

weights = cooc[cooc >= SLICING] 
weights = weights.stack() 
weights = weights / weights.max() 
cd_network = weights.to_dict() 

cd_network = {key:float(value) for key,value in cd_network.items()} 

You are just one step away from havlng an amazingly structured network. 
Let’s create a new empty graph, populate it with the edges from the dictionaiy, 
and update the “weight” edge attributes: 

U-py 

tagnetwork = nx.GraphO 
tagnetwork. add_edges_f roni(cd_network) 

nx.set_edge_attributes(tag_network, “weight", cd_network) 

The constructed network with the added attributes is your first fascinating 
resuit; save it without hesitation into a GraphML file in a specially created 
directoiy results [Share and Preserue Networks, on page 29). 

U-py 

if not os.path.isdir("results"): 
os.mkdir( " resuits " ) 

with open(" results/" + DOMAIN_NAME + " .graphml" , "wb") as ofile: 
nx.write_graphml(tag_network, ofile) 

If you had no access to non-Anaconda modules, you would abandon P 3 d;hon 
at this point and switch to Interactive Software like Pajek,^ UCINET® (which 
are outside the scope of this book), or Gephi (Chapter 4, Introducing Gephi, on 
page 31) for further network analysis. Fortunately, P 3 dhon has the community 
module [Outline Modularity-Based Communities, on page 136) that will let you 
stay in the same program for the entire analysis cycle. 

Extract and Name Term Communities 

The modularity of the new network is quite poor (we suggested on page 136 
that a network is definitely modular only when the modularity is 0.6 or above); 

U-py 

partition = community.best_partition(tag_network) 

print ( "Modularity: {}" .fo rmat(community.modularity(partition, 

tag_network))) 

nx.set_node_attributes(tag_network, “part" , partition) 

< Modularity: 0.15815567681142356 


7. viado.fmf.uni-lj.si/pub/networks/pajek/ 

8. sites.google.com/site/ucinetsoftware/home 
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Apparently, counting raw co-occurrences is not the best way to describe 
similarities—indeed, correlatlon-based networks are much more flexlble, and 
you will leam about them later (Chapter 14, Similarity-Based Networks, on 
page 163). However, even the coarse network that you have allows some 
meaningful interpretation. 

The partition that you extracted precisely delines the term communities. (Add 
them as an attribute part to the network nodes.) The followlng figure shows 
the whole network of tags. The node diameter represents the number of times 
the corresponding tag was mentioned in the corpus, and the node colors 
match the term communities. 
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So, you extracted the communities, but they are nameless. You could have 
explored them with your bare eyes and come up with some proper names, 
but then you would not be a hardcore P 3 dhon programmer after that, would 
you? Instead, add a cheriy on the cake and write another seven lines of code 
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that locate the five most frequently used terms per cluster. Hopefully, they 
indeed describe the content. 

You need a little helper function describe_cluster(terms_df). The function takes a 
DataFrame of terms in one community, extracts the namesake rows from the 
original domain, calculates their use frequency, and retums the top HOW MANY 
performers. 

U-py 

HOWMANY = 5 

def describe_cluster(terms_df): 

# terms_df is a DataFrame; select the matching rows from “domain'' 
rows = domain.join(terms_df, ^ 0 ^=“ inner") 

# Calculate row sums, sort them, get the last HOW MANY 

topN = rows.sum(axis=l).sort_values(ascending=False)[:H0W_MANY] 

# What labeis do they have? 
return top_N.index.values 

Finally, convert the partition into a DataFrame, group the rows by their partition 
ID, and beg the helper to come up with a name for each community. 

U-py 

tag_clusters = pd.DataFrame ({"part_id" : pd.Series(partition)}) 
results = tag_clusters.groupbyi "part_id" ).apply(describe_cluster) 
for r in results: 

printC'-- 0"-format("; " . joinl r. tolist ()))) 

Surprisingly, it works! 

-- good wife; harry potter; game throne; misfit; skin 

-- music; reading; movie; writing; book 

-- bone; grey ' anatomy; gilmore girl; house; friend 

-- battlestar galactica; west wing; x - file; hugh laurie; house md 

-- lost; glee; fringe; vampire diary; 30 rock 

-- doctor; veronica mar; firefly; supernatural; buffy vampire slayer 

Each line shows up to five most frequently mentioned terms per cluster (the 
terms are separated by semicolons). Some terms look strange and barely 
recognizable (“x - file”)—^but remember all those rigorous transformations that 
they had to go through, such as lemmatizing! Some terms are duplicates 
(“house md”—“house”), but this simply means that the transformations were 
not rigorous enough. 

Interpret the Results 

There are two levels of interpretation of the CDA results. 

At the lower level, you can conclude that all 319 terms selected for analysis 
are important for The Good Wife viewers—otherwise, the viewers would not 
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have selected them. Moreover, some of the terms go together more often than 
the others. You can see that “bone[s],” “grey ’[s] anatomy,” and “gilmore girl[s]” 
have something to do with each other (hint: Sean Gunn starred in all three 
series); “veronica mar[s]” and “firefly” are TV shows that were hoth prematurely 
eaneeled; “musie,” "reading,” and “writing” are popular activities... You can 
make these mechanistic conclusions without having the slightest idea ahout 
the cultural domain. 

At the higher level, you might he an ethnographer, anthropologist, psycholo- 
gist, or sociologist mterested in the mindscape of The Good Wife fans. In 
particular, you might want to compare their mindscape to the mindscapes 
of, say, Harry Potter or House, M.D. fans. However, since you are a humble 
computer programmer or data scientist, you shall entrust the higher level 
interpretation to the SMEs. 

This case study guided you from selectmg an online blogging (or, rather, gossip- 
ing) communlty to constructing and partially interpreting a cultural domain— 
a semantic network of terms partitioned into stx term collections. The method 
is fairly extensible and can be applied to other topical communities. 

You have seen a network of words, but you have not seen it all. In the next 
chapter, we will show you a network of cosmetics! 
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In Constantinople there are some persons, particularly Armeniam, 
who devote themselves to thepreparation ofcosmetics, and obtain 
large sums ofmoney from those desirous oflearning this art. 

G. W. Septimus Piesse, British perfumer 


CHAPTER 1 3 


Case Study: Going from 
Products to Projects 


This chapter uses 


One of the goals of product network analysis is to identify 
nontrivial collections of co-purchased or co-recommended 
Matplotiib, community. products. We can treat such collections as “customer 

projects” or “toolsets.” You can find these networks of prod¬ 
ucts frequently purchased together or recommended to be used together in 
marketing, advertising, and similar business disciplines. 


As an example of product network analysis, let’s have a look at cosmetics 
sold by Septiora®. In this case study, you will learn how to convert a CSV 
data file with cosmetics co-purchasing data into a complex network with 
the help of csv, itertools, and collections libraries. You will calculate attribute 
assortativity of the complex network and blockmodel it—construet its 
higher-level representation as an induced graph. Finally, you will use 
graphvizJayoutO to produce a picture of the network without invoking any non- 
Python Software. 


Read Data 


Most products on Sephora’s website have “Use With” recommendations: one 
or more other products that the Sephora staff recommends customers pur- 
chase in conjunction with the original product.^ For each product, the website 
contains plenty of characterizing information, such as brand, categoiy, price, 
volume, and star rating. Each product is uniquely identified by the store SKU 
number and an alphanumeric ID. We will build a network of “Use With” 


1. www.sephora.com 
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► 


Products created from prevlously acquired data and explore its structure 
concerntng the product categories. 

^ Crawling Sephora ^ 

If you want to download Information about a speciflc product, you can program a 
crawling procedure usrng modules urilib.request for the actual download, and BeautifulSoup 
(bs4) for parslng the HTML response. (Both modules are outslde of the scope of thls 
book.) Sephora’s website has a stralghtforward organizatlon. Onee you extract the 
IDs of the recommended products, you can repeat the download/parse cycle untll 
your script finds no more new products. Slnce the “Use With" network is discormected, 
you wlll need to run the crawling procedure more than once, startlng from randomly 
selected products, to harvest ali or at least the largest components. 

L J 

For your convenience, we provide the raw data for the network construction 
in two CSV files. File use-with.csv has 3,943 rows: the first row is the header; 
the remainlng rows contain network edges as edge ID (not needed in this case 
study), start product node, and end product node. We assume that the network 
is undirected (in reality it is not). File product.csv has 2,976 rows: the first row 
is the header: the remainlng rows describe product nodes (one node per row) 
and contain product attributes as product ID, brand, star rating, and categoiy. 
The latter two attributes may be empty. 

As always, we start by importing all the necessaiy modules (the Aside titled 
“Where to Import?” on page 102 explains why): 

products.py 
import csv 

from collectioris import Counter 
from operator import itemgetter 
from itertools import chain, groupby 
import networkx as nx 

from networkx.drawing.nx_agraph import graphviz_layout 
import community 
import matplotlib.pyplot as plt 
import dzcnapy_plotlib as dzcnapy 

Our next step is to read the edges and product attributes from the files, con¬ 
struet a vanilla network, and decorate its nodes with attributes [Add Attributes, 
on page 23): 

products.py 

with openCuse-with . csv") as usewith_file: 
reader = csv.reader(usewith_file) 
next(reader) 

G = nx. f rom_edgelist((nl, n2) for nl, n2 in reader) 
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with open ( "Products. CS1/" ) as product_file: 
reader = csv.reader(product_file) 

>■ next( reader) 

brands = {} 
cats = {} 
star_ratings = {} 

for ppid, brand, star_rating, category in reader: 
brands[ppid] = brand 
cats[ppid] = category 

>■ star_ratings[ppid] = float(star_rating if star_rating else 0) 

# Set node attrlbutes, based on produci attributes 

attributes = {"brand" : brands, "category" : cats, "star" : star_ratings} 
for att_nanie, att_value in attributes . items (): 

nx.set_node_attributes(G, attname, att_value) 

Note how we use next(reader) to skip the header rows in the first two highlighted 
lines, and how we impute zero star rating for the rows that do not have the 
star rating field in the last highlighted line. 

Analyze the Networks 

The resulting graph G has 2,975 nodes and 3,162 edges. It is veiy sparse: 
print(nx.density(G) ) 

< 0.0007147660678259198 

It also has a lot of small eonneeted eomponents with two to four nodes, as 
shown in the figure on page 156. (You ean call nx.connected_components(G) and 
measure the slze and eount of them on your own.) 

To keep the ease simple, we eonsider only the largest component (the GCC). 
We sort all eomponents of G by size, select the last one (the largest!), joln the 
respectlve label lists into one with chain.fromJterableO, and extract the subgraph 
induced by these nodes. We store the resulting subgraph in the variable 
called gccs: 

products.py 
TOPHOWMANY = 1 

gccs_nodes = chain . f rom_iterabie(sorted (nx. connected_connponents (G), 

key=len)[-TGP_HOWMANY:]) 

gccs = nx.subgraph(G, gccs_nodes) 

The subgraph contains 25 percent of nodes and 36 percent of edges from the 
original graph, and it also has the most interesting structural elements. If 
you don’t fancy these numbers, simply change TOP HOWMANY. 
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So, what can we say about the distributiori of the graph attributos? Do 
neighbors tend to be assortative or disassortative [Estimate Network Untformity 
Through Assortatiuity, on page 97)? Let’s find out: 

products.py 

for attname in attributes: 

print ( "/4ssortatii/ity by {}: {}"\ 

. format(att_name, 

nx.attribute_assortativity_coefficient(gccs, att_name))) 

< Assortativity by category: 0.03577904976206569 
Assortativity by brand: 0.8687551723142831 
Assortativity by star: -0.0058012311220827645 

The news is mtxed. On the one hand, conneeted produets are veiy likely sold 
under the same brand—because cosmetic brands provide comprehensive 
toolkits! On the other hand, conneeted produets belong to different categoiies 
—indeed, why would one buy two tools from the same category together? On 
the third hand (yes, computer scientists can have as many hands as it takes 
to describe the problem, as long as ali hands, except for the first two, are 
Virtual), conneeted produets have unrelated star ratings. The last resuit is 
confusing, and you can leave the question open until you can afford to hire 
a subject-matter expert. 
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Because of the poor node assortativity by categoiy, we expeet a weird mixture 
of eategories within any structural element—for example, wlthin modularity- 
defined eommunities. Let’s partition the network into eommunities and see 
how they are eonneeted and named. 

products.py 

part = community.best_partition(gccs) 

print ( "Nodularity: {}“ . format(community.modularity(part, gccs))) 

< Modularity: 0.8241527716500038 

The following statements ereate a list of lists of nodes m eaeh eommunity by 
colleeting the nodes with the same partition ID. Function itertools.groupbyO 
demands that the sequence is already sorted by the same key as would be 
used for grouping. In our ease, the key is the partition ID, the second element 
of eaeh tuple on the Ilst retumed by parts.itemsO, thus itemgetter(l). We will need 
the list later to auto-generate community labeis. 

products.py 

groups = groupby(sorted(part.items(), key=itemgetter(l)), itemgetter(1)) 
community_labels = [Iist(map(itemgetter(0), group)) for group in groups] 
subgraphs = [nx.subgraph(gccs, labeis) for labeis in community_labels] 

We could use the prevlously constructed list of lists as a partition in nx.block- 
modelO, but instead, we will utilize community.induced_graph(partition, graph), the 
blockmodeling tool from the community llbraiy: 

products.py 

induced = community.induced_graph(part, gccs) 

>■ induced.remove_edges_from(induced.selfloop_edges()) 

The induced graph usually has many self-loops because of copious connections 
between the nodes in the oiigmal network that belong to the same community. 
We remove the loops (on the highlighted line) to avoid clutter in the future 
network printout. 

Name the Components 

The new mduced graph nicely reflects the macroscopic structure of the origmal 
product network. It has only eighteen nodes and twenty-nine edges. The nodes 
are nameless so far, and we need to give them names. Having no better source 
of labeis than the product eategories, we select the most popular categoiy 
within eaeh induced node as the node label. We need an auxiliary function 
to obtain the name of the dominant categoiy rn a community. The Sephora 
website reports categoiy names as colon-separated hierarchical paths. To 
save space in the future printout, we keep only the last path component: 


report erratum • discuss 


Chapter 13. Case Study; Going from Products to Projects • 158 


products.py 

def top_cat_label(community_subgraph): 
items = [atts[ "category" ] for atts 

in community_subgraph.nodes{data=True)] 
top_category = Counter(items).most_common(l)[0] 
top_label_path = top_category[0] 
return top_label_path. split (".■")[-!] 

Function collections.Counter(sequence) Is an Indispensable tool for counting 
occurrences of unique items in a sequence. It returns a dictionaiy-style Counter 
objeet with the method Counter.most_common(n) that reports a list of the n most 
popular items as (label.count) tuples (we only need the item label). 

There may be several communities with the same dommant categoiy in the 
network. It we blindly relabel them, their respeetive indueed nodes will have 
the same label, and NetworkX will combine them into one node. To avoid 
unneeessaiy node merging, let’s append the community ID to each label. The 
new labeis look somewhat odd, but at least they are unique: 

products.py 

mapping = {comm_id: "fl/O" ■format(top_cat_label(subgraph), comm_id) 
for comm_id, subgraph in enumerate(subgraphs)} 
indueed = nx.relabel_nodes(induced, mapping, copy=True) 

At this point, our analysis is complete, but the data sponsor (the person or 
organization who ordered us the study) would rather see a nice picture than 
read a thousand barely decipherable labeis. It’s time to produce a picture. 
Function graphvizJayoutO [Hamess Graphuiz, on page 28) attempts to find 
appropriate positions for the graph nodes, and nx.draw_networkx() draws the 
graph. The last function takes tons of parameters: you can customize edge 
and node colors, sizes, labeis, and so on. You can save the resulting picture 
into a file, or display it on the screen, or both. 

products.py 

attrs = {''edge_color'' : “gray", "font_size'' : 12, "font_weight" : “bold" , 
''node_size'' : 700, "node color" : “pink", "width" : 2, 

"font_family" : "Liberation Sans Narrow"} 

# Calculate best node positions 
pos = graphviz_layout(induced) 

# Draw the network 

nx.draw_networkx(indueed, pos, **dzenapy.attrs) 

# Adjust the extents 
dzenapy.set_extent(pos, plt) 

# Sa\/e and show 

dzenapy.plot( "ProductNetwork" ) 
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The output of the seript is in the following figure. Since graphvizJayoutO uses 
random numhers to calculate the hest network layout, you will prohably see 
a different picture when you execute the same code. 



To understand the figure better, let’s explore the degrees of the induced nodes 


with nx.degree(induced). The results are shown in the following table. 

Node(s) Degree 

Foundation/14, Face Brushes/3 8 

Makeup & Travel Cases/8 7 

Blush/9 6 

Eyeliner/13 5 

Makeup Palettes/11 4 

Eye Palettes /10, Eyebrow/16 3 

Contour/17, Face Brushes/0, Foundation/15, Foundation/4 2 

Moisturizers/7, Highlighter/5, Lipstick/1, Mascara/2, Mascara/6, 1 
Nail Care/12 


At the top of the table, you can see the cosmetics essentials that are requlred 
for makeup but are not visible—namely, foundations and tools (brushes, 
cases, palettes). The most eye-catching tools are at the bottom of the table: 
mascaras, lipsticks, nail care tools, and highlighters. We can hyqDothesize 
that if a node is connected to (recommended to be “used-with”) fewer neigh- 
bors, it is more specialized. The specialized nodes are at the peripheiy of the 
product network and depend on the more general nodes in the core. 
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Just like some other case studies presented in the book, the “products to 
projects” case is not limited only to the Sephora products. Given sufficient 
co-purchasing data, you can build a network of products, identlfy dense 
product communities, name them, and argue about possible reasons for their 
existence. 

In the Next Part 

The complex networks you have seen so far had a reasonably crisp structure. 
For any two nodes, you could say, with a fair degree of confidence, whether 
there was an edge incident to them or not. That is not how things work in 
real life. 

In real life, there is almost always a degree of uncertainty involved in binaiy 
relationships. Alice, Bob, and Chuck may be friends, but Alice and Bob may 
be better friends than Alice and Chuck. A husky and a reindeer may be less 
likely to be bought together than a husky and a penguin. The uncertainty is 
a fact, and we need to know how to deal with it. In the next part, you will 
leam how to cormect nodes in a fuzzy way based on potential similarity. 
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Part IV 

Unleashing Similarity 


Complex networks rarely have a crisp structure, 
whereby two nodes are unquestionably connected 
or not. The nodes are not always homogeneous, 
either. In this part, you will leam different ways of 
quantifying potentia! similarity between nodes and 
analyzing networks consisting of more than one 
class qf nodes. 




With these I have found nothing identical in anyofthe various books 
ofEmblems which Ihave examined; indeed, I cannotsay that Ihave 
met with anything similat. 

Henry Green, English author 


CHAPTER 1 4 


Similarity-Based Networks 


This chapter uses Pandas, 


Complex networks rarely have a crisp strueture that shows 
whether two nodes are unquestionably eonnected. The nodes 
NumPy, SciPy. always homogeneous, either. However, sometimes 

two items are similar, and as a complex network analyst, 
you need a toolset for quantlfying their similaiity and converting similarities 
into network edges. 


In this chapter, you will learn (or refresh your knowledge of) several similarity 
measures: Hamming distance, Manhattan distance, Pearson correlation, 
cosine distance, and generalized similarity. You will familiarize yourself with 
several real-world networks based on similarity. 


Understand Similarity 

Similarity-based networks emerge from the similarity of one or more attributes 
of objects represented by the network nodes. The t 3 qDe of objects and the 
number of attributes are limited only by the creativity of the network 
researchers. (This is not to say that your limitless imagination, rather than 
your experience, should guide your research.) The nodes may represent people 
(age, gender, language, skin color), products (price, color, shape, material), 
companies (industiy, size, countiy, the form of ownership), and so on. It is 
your job to choose the “right” definition of similarity that at least does not 
contradict common sense. 


Any quantitative measure of similarity has two aspects: what to measure and 
how to measure. In the case of similarity-based networks, the first aspect 
addresses the choice of significant nodes attributes, and the second aspect 
refers to transforming the attributes into distances. 
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Let’s start wlth the first issue by looking at two veiy different similarity-based 
networks—an event network and a food network—and their node attributes 
(or the laek of them). 

Creating New Attributes 

When it eomes to similarity-based networks, the most frustrating situation 
is when you want to use node attributes to caleulate similarity, but the 
nodes have no attributes at all. The situation is not entirely desperate. Let’s 
have a look at a dataset that furnishes no attributes, and build them from 
the ground up. 

An event network is a similarity-based network where nodes represent formal 
or informal social events (book club meetings, political rallies, academie 
conferenees, rock eoncerts, and the like) and edges depict their similarities. 
The nodes in an event network are usually not hard to identify, but the simi¬ 
larities may be subtle. Let’s find out how to build an event network in five 
ways. (Incidentally, this chapter covers five types of similarity, dismisses one, 
and makes a promise to cover one more later.) 

In the 1930s, five American ethnographers (Allison Davls and his colleagues) 
assembled a dataset of eighteen women in Natchez, Mississippi who attended 
fourteen social events over a nine-month period. Once published [Deep South 
[DGG41J), the dataset eventually became the foundation of a “canonical” 
“Southern Women” network. In fact, it is so respected in social network analysis 
that NetworkX has a special function nx.davis_southem_women_graph() for generatmg it. 
The figure on page 165 shows the network chart. 

Suppose your goal is to transform the network of women and events into a 
network of events. Such transformations are called projections. You will leam 
more about projections in Prqject Bipartite Networks, on page 178; now, we will 
approach them mformally. 

Each event node in the original network is also a node in the event network. 
To calculate similarity, you must select relevant node attributes. Formally, 
the nodes have no attributes, aslde from their arbitrarily assigned numerical 
labeis. So, why not treat the identities of the women who attended an event 
as that evenfs attributes? 

After you obtain the synthetic graph of the Southern women and the 
attended events Gl, you should separate the nodes into the “women” and 
“events” subsets. In general, this operatlon may be hard, but in the Davls 
network, all event labeis start with the capital letter E, followed by one or 
more decimal digits. If a node matches the regular expression “E\d+”, it 
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represents an event. All network neighbors of an event node (for example, 
EI) must be “women nodes,” wbose labeis are aceessible as tbe keys of tbe 
dictionaiy of neigbbors. You ean create a Pandas DataFrame tbat bas only one 
column of ones, named affer tbe event [EI) and indexed by tbe names of 
tbe attending women (in tbe case of EI: Brenda Rogers, Laura Mandeuille, 
and Evelyn Jejferson). 

southern_women.py 

G1 = nx.davis_southern_women_graph() 

attendees = [pd.DataFrame({event: 1}, index=list(women.keys())) 

for event, women in G1. edge. items () if re .match( "£\cf+" , event)] 
attmtx = pd.concat(attendees, axis=l).fillna(0).astype(int) 
print(att_mtx) 

Wben you concatenate all fourteen DataFrames, Pandas performs tbe index 
alignment: eacb DataFrame is expanded as needed to place all data items witb 
tbe same index in tbe same row. If tbe expansion results in gaps, tbe gaps 
are filled witb a NaN —tbe NumPy designator for missing data. Replace tbe NaNs 
witb zeros to complement tbe ones—and youVe got a binaiy attribute set for 
eacb event node. 
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< 


Brenda Rogers 
Charlotte Mc... 
Dorothy Mure... 
Eleanor Nye 
Evelyn Jeffe... 
Flora Price 
Frances Ande... 
Helen Lloyd 
Katherina Ro... 
Laura Mandev... 
Myra Liddel 
Nora Fayette 
Olivia Carleton 
Pearl Ogleth. . . 
Ruth DeSand 
Sylvia Avondale 
Theresa Ande... 
Verne Sanderson 


E9 E7 EI E2 E10 

0 110 0 

0 10 0 0 

1 0 0 0 0 

0 10 0 0 

10 11 0 

1 0 0 0 0 

0 0 0 0 0 

0 10 0 1 

1 0 0 0 1 

0 111 0 

1 0 0 0 1 

110 0 1 

1 0 0 0 0 

1 0 0 0 0 

110 0 0 

110 0 1 

110 1 0 

110 0 0 


E5 E3 E12 E13 

110 0 

110 0 

0 0 0 0 

10 0 0 

110 0 

0 0 0 0 

110 0 

0 0 10 

0 0 11 

110 0 

0 0 10 

0 0 11 

0 0 0 0 

0 0 0 0 

10 0 0 

0 0 11 

110 0 

0 0 10 


Eli E6 E14 E8 E4 
0 1 0 11 

0 0 0 0 1 

0 0 0 1 0 

0 1 0 10 

0 1 0 11 

1 0 0 0 0 

0 1 0 10 

10 0 10 

0 0 110 

0 1 0 10 

0 0 0 1 0 

11 10 0 

1 0 0 0 0 

0 1 0 10 

0 0 0 1 0 

0 0 110 

0 1 0 11 

0 0 0 1 0 


Each column in the resulting matrix represents an event, each row accounts 
for a woman, and the number at an intersection is one of the eighteen event 
attributes—it indicates the presence (one) or absence (zero) of the woman at 


the event. You will learn in Choose the Right Distance, on page 167, how to 
eonneet the nodes to construet a similarity network. 


Binarizing Existing Attributes 

Dealing with node items that do not have attributes is tough. Items that do 
have attributes are much easier to handle. Just choose the attributes that 
are essential for your future network (you may want to use ali available 
attributes or eliminate the insignificant features). 

Do you remember the network of products in your pantry [Explore Your Pantry, 
on page 120)? You can use the data from the same USDA website to create a 
different kind of network.^ For most food items, USDA provides the amounts 
of energy (in kcal), proteins, lipids, carbohydrates, fibers, and sugars (as well 
as minerals and vitamins) per serving. Each of the values is a potential con- 
tinuous attribute of a future network node. 

If the similarity measure you plan to use works only with binary attributes, 
you must first dichotomize (binarize) the attributes that don’t fit. (In the case 
of the pantry project, aU the attributes are in the wrong form.) Dichotomization 
can be accomplished either in plain Python or Pandas by comparing the value 
of an attribute with the mean or median value of the same attribute. Let’s 


1. ndb.nal.usda.gov/ndb/search 
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say the list protein contalns the amounts of protelns in eaeh food item. Then 
protein bin is the dichotomized list: 

import statistics 

threshold = statistics.mean(protein) 

# The choice of median ensures a balanced spllt! 

# threshold = statistics.median(protein) 
proteinbin = [p >= threshold for p in protein] 

The Pandas solution involves converting the list to a Series unless the original 
data were in a Pandas format: 

protein_ser = pd.Series(protein) 

proteinbin = protein_ser >= protein_ser.mean() 

# protein_bin = proteinser >- protein_ser.median() 

Whether you used original node attributes, dichotomized them, or inferred 
them from structural or other data, the next step in constructing a similarity- 
based network is to calculate distances between the nodes and convert them 
Into weighted edges. 

Choose the Right Distance 

All similarity measures are numeric (usually on the scale from -1 to 1 or 0 to 
1), so you must quantify any qualitative attributes before calculating simi- 
larities. Once quantified, the attributes can be thought of as coordinates of 
the object in a multidimensional coordinate space, where the number of 
dimensions equals the number of attributes. You can treat an object as a 
point in space, whose position is defined by the attributes. The similarity 
between two objects and the distance between the points representing the 
objects are complementary; the higher the distance, the smaller the similarity 
and vice versa. 

Let’s now have a look at some tjqaical distance and similarity measures. 

Hamming Distance 

Let’s suppose we want to build a network of objects that may have some 
binaiy features. A feature is either present or not, and if it is present, then 
the magnitude of the feature (if applicable) does not matter. The Hamming 
distance between two objects is the number of features that are present in 
one object, but not in the other, divided by the maximal number of features. 
The similarity, conversely, is the number of features jointly present or absent 
m both objects (again, divided by the maximal number of features). If two 
objects have an identical set of features, they are more similar than two objects 
whose features differ. 
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The Hamming distance definition can be extended to include categorical 
attributes. In this case, the distance between two objects is the number of 
attributes that are equal in both objects. The attributes do not even have to 
be quantitative. 

Consider several vegetables with three attributes: shape, color, and starchi- 
ness. Some attributes are binaiy (for example, starchy) and some are qualita¬ 
tive (color). 


Vegetable 

Shape 

Color 

Starchy 

Carrot 

Conic 

Orange 

False 

Com 

Conic 

Yellowish 

True 

Potato 

Round 

Yellowish 

True 

Tumip 

Round 

Yellowish 

False 


Note that some attributes are binaiy and some are qualitative. Potatoes and 
tumips have two equal attributes (shape and color). The distance between 
them is 1/3, the similarity 2/3. Turnips and carrots are 1/3 similar, and 
potatoes and carrots are not similar at all (zero similarity), and so on. The 
following code fragment calculates Hamming similarity (complementaiy to 
the Hamming distance). The dataset for it has the same structure as G.node: 


data = { 

"carrot" : {" shape" ■. “conic", “color" ■. "orange", “starchy" : False}, 
“corn" : {“shape": “conic", “color": “yellowish" , “starchy" : True}, 
“potato" : {"shape": “round", “color": “yellowish" , “starchy" : True}, 
“turnip" : {"shape": “round", “color": “yellowish" , “starchy" : False} 
} 

# Collect attribute names for all vegetables 

atts = set.Union(*[set(x.keys{)) for x in data.values()]) 


# Assume that each node has each attribute, but the values may differ 
sim_2dlist = [[sum(data[vl] [att] == data[v2][att] for att in atts)\ 

/ len(atts) for vl in data] 


for v2 in data] 


< [[1.0, 0.3333333333333333, 0.0, 0.3333333333333333], 

[0.3333333333333333, 1.0, 0.6666666666666666, 0.3333333333333333], 
[ 0 . 0 , 0 . 6666666666666666 , 1 . 0 , 0 . 6666666666666666 ], 
[0.3333333333333333, 0.3333333333333333, 0.6666666666666666, 1.0]] 


The similarity between an item and itself is 1. 

Python offers another way to calculate the Hamming distance: by calling the 
namesake function hammingO from the module scipy.spatial.distance. The benefit 
of using hammingO is the abstraction that the function creates: you or your 
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code reader do not need to understand exactly how the Hamming distance 
is calculated to benefit from it. 

import scipy.spatial.distance as dist 

sim_2dlist = [[1 - dist.hamnnlng(list(data[vl] .valuesO ), 

list(data[v2].values())) for vl in data] 

for v2 in data] 

Note that we subtract the distance from one to convert it to similarity. Inci- 
dentally, the module has another four dozen functions for similarity measure- 
ments, some of which you will see later. 

The printout of sim_2dlist looks quite horrible and almost useless. Convert it 
to a NumPy array to add some order and enable vectorized operations: 

simarray = np.array(sim_2dlist) 

< array([[ 1. , 0.33333333, 0. , 0.33333333], 

[ 0.33333333, 1. , 0.66666667, 0.33333333], 

[ 0. , 0.66666667, 1. , 0.66666667], 

[0.33333333, 0.33333333, 0.66666667, 1. ]]) 

You stili won’t remember which column and row represent which vegetable, 
unless you convert the array to a Pandas DataFrame and supply human-readable 
labeis: 

sim_dataframe = pd.DataFrame(sim_array, columns=data, lndex=data) 

< carrot corn potato turnip 

carrot 1.000000 0.333333 0.000000 0.333333 

corn 0.333333 1.000000 0.666667 0.333333 

potato 0.000000 0.666667 1.000000 0.666667 

turnip 0.333333 0.333333 0.666667 1.000000 

You can use this data to construet a weighted network and slice it If necessaiy, 
as explained in Slice Weighted Networks, on page 79. 

The Hamming distance/similarity works best when future network nodes 
have many almost equally significant binaiy attributes whose pres- 
ence/absence is roughly equally probable—such as the network of event 
nodes in Creating New Attributes, on page 164. 

Manhattan Distance 

If your objects have non-binaiy or non-categorical attributes, the Hamming 
distance is not applicable. The Manhattan distance is an extension of the 
Hamming distance for continuous attributes. It is defined mathematically as 
d= I Axj I +1 Ax 2 I +. •. +1 Axj^j I , where d is the distance between two nodes A and 
B, and Ax—x^-Xg) is the difference between the values of their ith attribute. 
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In a human language, this means that the Manhattan distance between two 
objects is the sum of the differences of their attributes. For example, the dis¬ 
tance between Pennsylvania station (xAi=7th Avenue, x^=33rd Street) and 
the Metropolitan Opera (xBi=Columbus Avenue, X02=64th Street) in Manhattan, 
New York, is |7-9| + |33-64|=2+31=33 blocks. (If you’re not familiar with Manhattan, 
Columbus Avenue is another name for 9th Avenue. In either case, you also 
know now why the measure is called “Manhattan.”) 

To observe the Manhattan distance in action, Iet’s have a look at the first five 
humans’ heights and weights from the SOCR Data Dinov 020108 
HeightsWeights dataset.^ 

hwdata = [[65.78, 112.99], 

[71.52, 136.49], 

[69.40, 153.03], 

[68.22, 142.34], 

[67.79, 144.30]] 

SciPy provides function dist.cityblock(u,v) (because Manhattan is not the only city 
with blocks!) that takes two attribute vectors u and v and retums the Manhat¬ 
tan distance between them. 

hw_array = np.array(hwdata) 

fiveppl = np.array([[dist.cityblock(x, y) for x in hw_array] 
for y in hw_array]) 

< array([[ 0. , 29.24, 43.66, 31.79, 33.32], 

[ 29.24, 0. , 18.66, 9.15, 11.54], 

[ 43.66, 18.66, 0. , 11.87, 10.34], 

[ 31.79, 9.15, 11.87, 0. , 2.39], 

[ 33.32, 11.54, 10.34, 2.39, 0. ]]) 

You can define similarity as l/five_ppl, max distance-five ppi, or in any other com- 
plementary or reciprocal way. 

One major problem with the function dist.cityblockO is that it assumes that all 
attributes are comparable in range. This assumption does not hold in general, 
and it does not hold in the case of our five people in particular. Comparing 
weight to height is worse than comparing the proverbial apples to oranges! 
A workaround is to normalize each attribute by subtracttng the smallest value 
and dividing by the range: 

hw_range = hw_array.max(axis=0) - hw_array.min(axis=0) 
hw_norm = (hw_array - hw_array.min(axis=0)) / hw_range 


2. wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights 
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< array([[ 0. 

[ 1 . 

[ 0.63066202 
[ 0.42508711 
[ 0.35017422 


0 . ], 

0.58691309], 
1 . ], 

0.73301698], 
0.78196803]]) 


five_ppl_nornn = np. array ([ [dist. cityblock(x, y) for x in hw norm] 

for y in hw_norm]) 


< array([[ 
[ 
[ 
[ 
[ 


0 . 

1.58691309, 

1.63066202, 

1.15810409, 

1.13214225, 


1.58691309, 

0 . 

0.78242489, 

0.72101679, 

0.84488073, 


1.63066202, 

0.78242489, 

0 . 

0.47255793, 

0.49851977, 


1.15810409, 

0.72101679, 

0.47255793, 

0 . 

0.12386394, 


1.13214225], 
0.84488073], 
0.49851977] , 
0.12386394], 
0 . ]]) 


In a normalized vector, the smallest attribute always has the value of 0 and 
the largest is always 1. The Manhattan distance between two normalized 
attribute vectors takes equal care of each attribute and is guaranteed to not 
be greater than 2xN. Treat the numbers in the array as edge weights—and 
you are one step away from a similarity network of persons based on thelr 
weight and height. 


Euclidean Distance 

In case you thought you did not know what the Euclidean distance was, here 
is a hlnt: it is the famous, though heavily tailored, I^thagoras’ Trousers 
d^=Axi^+Ax 2 ^+...+AX]y^. The Euclidean distance is not partlcularly useful in 
CNA (except, perhaps, in geospacial networks) and is mentioned here simply 
because it is probably the most well-known distance measure. 


Cosine Distance 

The prevlous three distance/similarity measures treat nodes with N attributes 
as points in an N-dlmensional space. Sometimes, it makes sense to treat 
attributes as drrections—“lengthless” vectors. Consider a wlnd rose—a polar 
plot showlng t 3 qDical wind speed and direction distributions for a particular 
location. You can think of a wind rose as a set of stxteen floating polnt 
attributes representing the average wind speed in each direction with the 
22.5° angular step (N, NNE, NE, ENE, E, and so on). The figure on page 172 
shows the wind roses for four comerstone American cities: Anchorage, Boston, 
Chicago, and San Francisco. The radius is the number of hours a year that 
the wind blows at speed from 7 to 12 mph/rom that direction.^ So, who is 
the real Wlndy City? 


3. www.meteoblue.com/en/weather/forecast/modelclimate/ 


report erratum • discuss 






Chapter 14. Similarity-Based Networks 


172 


0 ° 



In general, you can calculate the similarity between two geographical locations 
as some inverse distance between the attribute vectors. However, when it 
comes to wlnds, it may be more important to know where they blow from 
rather than how strong they are. In other words, the angle (angular distance) 
between the attribute vectors may be more useful than the linear distance. 

The cosine distance is a measure of angular distance. It is defined as a com- 
plementary cosine of the angle between two attribute vectors and Xgi 
d=l-XAXB/(lxAll X IIx0|I)=1-cos(Xa,X 0). The cosine similarity is the cosine itself. 
Naturally, it ranges from 1 (the angle is 0°, the vectors are parallel) through 
0 (90°, the vectors are orthogonal and independent) to -1 (180°, the vectors 
are antiparallel). 

The cosine distance emphasizes the similarity of shapes, spikes, and other 
patterns, rather than actual values. You can calculate it directly from the 
equation mentioned previously. (Remember that 1x^1 is the Euclidean length 
of x^, and x^Xg is the scalar product of the two vectors.) Call the SciPy function 
cosine(u,v) to simplify your code and make it more abstract. The following code 
measures the cosme similarity (thus the 1-...) between the four cities, and it 
uses Pandas without further ado: 
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winds = { 

"Anchorage" : (58,60,132,552,291,180,88,62,58,36,20,4,3,16,119,81), 
••Boston" : (93,104,106,101,80,82,82,110,216,292,281,205,246,204,159,86), 
"Chicago" : (115,195,122,109,86,120,157,210,273,196,139,101,113,106, 
107,115), 

"San Franclsco" : (35,67,156,616,1208,894,268,67,2,0,0,0,2,9,22,35) 

} 

wind_cities_cosine = pd.DataFrame({y: [1 - dist.cosine(winds[x], winds[y]) 

for X in winds] for y in winds}, 
index=winds.keys()) 


Boston 
Chicago 
Anchorage 
San Francisco 


Anchorage 

0.408479 

0.523189 

1.000000 

0.791712 


Boston Chicago 
1.000000 0.884222 
0.884222 1.000000 
0.408479 0.523189 
0.264567 0.381017 


San Francisco 
0.264567 
0.381017 
0.791712 
1.000000 


According to the piintout, the two patrs of the most simUar cities are Anchorage 
and San Francisco (on the West) and Boston and Chicago (on the East). 


Pearson Correlation 

One of the complications with the cosine similarity is that it is not invariant 
to shifts: it fails to detect small variations of attributes. Uslng the wind rose 
example, if you add another thousand hours a year to each direction in each 
considered city, the cosine similarities of each pair of cities will approach 1.0, 
making them ali look the same. In other words, the cosine similarity formula 
overestimates similarity, which is not necessarily desirable. 

Another angular similarity measure is the Pearson correlation, which some 
of you may know from statistics. It often goes by the name “correlation” 
without the reference to Karl Pearson. It is not affected by shifts. 

The Pearson correlation is calculated using the same formula as for the cosine 
similarity, except that the attribute vectors are first translated by subtracting 
the mean m(x): s=(xA-m(xA)) x (xB-m(xB))/(IlxA-m(xA)ll x !lxB-m(xB)ll). 

SciPy provides the function stats.pearsonrO, which calculates both the correlation 
and its p-value as a tuple. If you’re not sure what the p-value is, think of it 
as the measure of credibility of the reported correlation. If the p-value is less 
than 0.01, the correlation can be trusted. If the p-value is above 0.01, you 
should not take the correlation seriously even if it is high. The followmg code 
calculates the correlation-based similarities for the “wind cities.” 

wind_cities_pearson = pd.DataFrame({y: [stats.pearsonr(winds[x], 

winds[y])[0] 

for X in winds] for y in winds}, 
index=winds.keys()) 
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Boston 
Chicago 
Anchorage 
San Francisco 


Anchorage 

-0.482339 

-0.288015 

1.000000 

0.704106 


Boston 

1.000000 

0.234174 

-0.482339 

-0.524232 


Chicago 

0.234174 

1.000000 

-0.288015 

-0.352705 


San Francisco 
-0.524232 
-0.352705 
0.704106 
1.000000 


An even more efficient solution is to convert the attribute dictionaiy winds into 
a Pandas DataFrame directly and then use the built-rn method .corr(). The results 
are numerically the same, but the rows and columns are nicely sorted by the 
City names. 

pd.DataFranie(winds) .corr() 


Anchorage 
Boston 
Chicago 
San Francisco 


Anchorage 

1.000000 

-0.482339 

-0.288015 

0.704106 


Boston 

-0.482339 

1.000000 

0.234174 

-0.524232 


Chicago 

-0.288015 

0.234174 

1.000000 

-0.352705 


San Francisco 
0.704106 
-0.524232 
-0.352705 
1.000000 


You’ll see an efficient application of Pearson correlation similarity in Chapter 
16, Case Study: Building aNetwork of Trauma Types, on page 185. 


Generalized Similarity 

The generalized similarity is a powerful recursive technique for measuring 
node similarities in bipartite networks. You need to leam more about bipartite 
networks to appreciate it. Let’s postpone its introduction until Compute Gen¬ 
eralized Similarity, on page 181. 

In this chapter, you leamed that if two items are not explicitly connected and 
don’t happen to be at the same place at the same time, you can stili cormect 
them with a network edge rf they are sufficiently similar. You leamed about 
different types of similarity and how to calculate similarity based on node 
attributes. Very often, similarity-based networks are derived from bipartite 
networks—the networks that can be separated in two subnetworks in such 
a way that no two nodes in the same subnetwork are adjacent. You’ll meet 
the bipartite networks in the next chapter. 
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Friendship is a singie soui dwelling in two bodies. 
Aristotie, Greek philosopher 


CHAPTER 1 5 


Harnessing Bipartite Networks 


Bipartite networks, also known as two-mode networks, are a mechanism for 
representing relationships between items that belong to two or more classes 
or parts (such as students and professors or airlines and airports). Many 
networks previously seen in this book are bipartite. 

In this chapter, you will learn how to check if a network is bipartite, assign 
the nodes to the respective parts, and convert weighted or unweighted 
bipartite networks into weighted one-part networks (the latter operation is 
called “projection”). 



You Don't Have to Be Bipartite—Even If You Can 
Being bipartite is both a topological property of a network and the 
way the part attributes are assigned to the nodes. Some networks 
—such as a ring with an odd number of nodes—cannot possibly 
be treated as bipartite. No matter how you label the nodes, there 
will always be two adjacent nodes that belong to the same part. 
Conversely, a ring with an even number of nodes can be bipartite 
—but only if odd and even nodes belong to different parts. 


It may be hard to believe, but you already built your first bipartite network— 
it was the network of food items and nutrients in Draw Your First Network 
with Paper and Peneii, on page 6. The network has two namesake classes of 
nodes: food items (beef, spinach, and so on) and nutrients (vitamin C, magne- 
sium, and so on). Each food item node is adjacent only to nutrient nodes, 
and each nutrient node is adjacent only to food item nodes. An edge always 
connects nodes that belong to different parts. 
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Work with Bipartite Networks Directiy 

Many networks you met in the book are bipartite. Before we look into the 
NetworkX functions and tools for the bipartite networks, let’s flrst revisit them. 


Examples of Bipartite Networks 

The following table contains the list of the some bipartite networks mentioned 
in this book so far. 


Name 

Location 

Part 1 

Part 2 

Introductoiy toy 

on page 6, 

Food items 

Nutrients 

network 

on page 21 



“Panama papers” 

on page 101 

Fntities and 

intermediaries 

Officers 

Products in your 

on page 120, 

Food items 

Ingredients 

pantiy 

on page 166 



Live Journal 

on page 141 

Users 

Interest terms 

Southern women 

on page 164 

Women 

Fvents 


Except for the “Panama papers,” each network has nodes of two t 3 rpes. (You 
can conditionally put the “Panama” entities and intermediaries in one part, 
but the resulting network is stili not stiictly bipartite. It is tripartite, as 
explained tn the following sidebar.) 

^ How About More Parts? ^ 

You have seen unlpartlte networks where any node can be connected to any node. 
You have seen bipartite networks where nodes Irnk only to nodes from the other part. 
The concept of network parts can be readlly extended to the case of k-partlte networks 
wlth k parts, as long as two adjacent nodes do not belong to the same part. An 
example of a tripartite network is a network of LlveJoumal users (part 1), communltles 
(part 2), and Interest terms (part 3). A user belongs to a communlty (a 1**2 type edge); 
a user is Interested in a term (a 1«3 t 5 rpe edge); and a communlty declares a term as 
an Interest (a 2«3 type edge). 

V 7 

You can build bipartite networks naturally because quite often real-world 
datasets already Include nodes of more than one t57pe. However, they are not 
easy to analyze and interpret. For starters, even such a basic measure as 
node degree may be of questionable use in a bipartite network. Indeed, does 
a company node with one hundred adjacent employee nodes have the same 
degree as an employee node with one hundred adjacent company nodes? In 
the former case, we are talking about a typical medium-size business. In the 
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latter case, the employee is probably a freelancer tiylng to make their one 
hundred clients happy. Different types of centralitles, paths, cores, eliques, 
modularity-based communities, and other measures and structural elements 
of bipartite networks may be similarly incomparable and even meaningless. 

Basic Bipartite Functions 

To check whether network G is bipartite, call the predicate function nx.is_bipartite(G). 
If the function retums False, skip the rest of the chapter. Walt, don’t. We will 
use the pickled network of foods and nutrients nutrients.pickle (created in Read 
a Network from a CSV File, on page 21), which is known to be bipartite/ 

bipartite.py 

from networkx.algorithms import bipartite 
N = pickle. load(open ( "nutrients. pic/cle" , “rb")) 
print(bipartite.is_bipartite(N) ) 

< True 

Note most of the bipartite functions come from the module nx.algorithm.bipartite, 
which you must correctly import. 

Function bipartite.sets() sphts the nodes of a bipartite network into two parts (and 
retums two node sets). The function does not look at the node attiibutes. The 
separation it performs is based purely on the network topology. It is your 
responsibility to recognlze the meaning of each part. For example, you can check 
which set contalns the vitamln C. The same set must contain aU other nutrients. 

bipartite.py 

bipl, bip2 = bipartite.sets(N) 
print("C" in bipl, "C" in bip2) 

< False True 

You can use the followlng code fragment to initialize the two sets without 
second-guessing which part contalns which nodes: 

bipartite.py 

foods, nutrients = (bip2, bipl) if "C" in bipl else (bipl, bip2) 
print(foods, nutrients) 

< {'Spinach', 'Beans', 'Poultry', 'Veg Oils', 'Green Leafy Vegs', 'Cheese', 

'Asparagus', 'Potatoes', ' Fatty Fish', 'Carrots', 'Beef, 'Liver', 

'Seeds', 'Mushrooms', 'Eggs', 'Broccoli', 'Wheat', 'Whole Grains', 

'Pumpkins', 'Tomatoes', 'Kidneys', 'Legumes', 'Yogurt', 'Milk', 'Nuts', 
'Shellfish'} {'Thiamin', 'Folates', 'B6', 'E', 'Mn', 'Se', 'B12', 'D', 

'A', 'Riboflavin', ' C' , 'Zn', 'Cu', 'Niacin', 'Ca' } 


1. pragprog.com/titles/dzcnapy/source_code 
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It is possible to analyze two-mode networks directly (Networks: An Introduction 
[New 10], Exploratory Social Network Analysis withPqjek (Structural Analysis 
in the Social Sciences) [NMBll], Analyzing Social Networks [BEU13J), and net- 
workx.algorithms.bipartite provides a variety of functions for such analysis. The 
functions are mostly direet counterparts of the unipartite namesake brethren: 
bipartite.densityO, bipartite.degreesO (ef. nx.degreeO), bipartite.ciusteringO, bipartite.ciose- 
ness_centraiity(), bipartite.degree_centraiity(), bipartite.betweenness_centraiity(), and some 
generator functions, including bipartite.random_graph(). It is also customaiy to 
convert bipartite networks into unipartite networks by projecting on one of 
the constituent parts. 


Project Bipartite Networks 

You can project a bipartite network two ways: by keeping 
the nodes of part 1 and removing the nodes of part 2, and 
the other way around. The nodes that survive the projection 
are called the “bottom” nodes; the nodes that are removed 
are known as the “top” nodes. 

The projection operation transforms the original bipartite graph G into an 
Induced graph F by projecting G onto the bottom nodes. Graph F contains only 
the bottom nodes, and two bottom nodes in F are adjacent to each other if 
and only if they are adjacent to the same top node in G. 

The following figure shows a fragment of the bipartite network of foods and 
nutrients before (left) and after the projection. The nodes C and Ca represent 
nutrients; they are the top nodes. All other nodes represent foods; they are 
the bottom nodes. Note that all food nodes connected to the same nutrient 
node in the original network form a clique in the induced network—the clique 
of foods providing that nutrient. 



This section uses SciPy, 
Pandas. 
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Function bipartite.projected_graph((G,nodeset)) projects a bipartite graph G onto the 
nodes nodeset (the nodes must exist in G and belong to the same part). 

bipartite.py 

n_graph = bipartite.projected_graph(N, nutrients) 
fgraph = bipartite.projected_graph(N, foods) 

The resulting network is undirected, unweighted, and unipartite. (And, by 
the way, it is a product network.) You can caleulate degrees, centralities, and 
path lengths; extraet eliques, cores, and communities; and perform any other 
eomplex network analysis of it. The network stili contains some knowledge 
of the eonnecttng nutrients, but the knowledge is implicit. Just hke a geometric 
projection, a network projection is lossy and irreversible (one cannot recon¬ 
struet the original bipartite network from one of its projections). 

There may be more than one top node connecting a palr of bottom nodes in the 
same network. You can assign weights to the induced edges to reflect the con- 
nectionstrengthbycalling the function bipartite.weighted_projected_graph(G,nodeset,ratio). 
(The last parameter Controls whether the weights are absolute or relative.) 

bipartite.py 

fw_graph = bipartite.weighted_projected_graph(N, foods, True) 

The following figure shows the weighted induced network of food items con- 
nected by the similarity in terms of provided nutrients. It is stili undirected 
and unipartite. 
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Edge wldth in the flgure represents the weights, and the weights represent 
the similarity between the nodes. (The more shared nutrients they had in the 
original network, the higher the similarity.) The mduced network is a similarity- 
based network, and eveiything you read about such networks in Chapter 14, 
Similarity-Based Networks, on page 163 applies to it, too. Moreover, the network 
is based on Hamming Distance! You may wonder if you can use other distance 
definitions. Yes, you can—but on your own, without much help from NetworkX. 

As an exercise, let’s build a network based on the Pearson Correlation. Start 
by calculating the bi-adjacency matrix. A bi-adjacency matrix is like an Inci- 
dence Matrix, except that the matrtx rows and columns represent the top and 
bottom nodes, respectively. (You have to teli NetworkX which nodes are top and 
which are bottom by passrng the list of bottom nodes as the second parameter.) 
For each pair of rows, compute the Pearson correlation and arrange the results 
into a square Pandas DataFrame food. 

bipartite.py 

ad] = bipartite. biadjacency_matrix(N, f_graph) .toarrayO 
foods = pd.DataFrame([[stats.pearsonr(x, y)[0] for x in adj] 

for y in adj], columns=f_graph, index=f_graph) 

SLICINGTHRESHOLD = 0.375 
stacked = foods.stack() 

edges = stacked [stacked >= SLICING_THRESHOLD] .index.tolistO 
fpearson = nx.Graph(edges) 

The matrtx contains the similarities between the bottom nodes wlth respect 
to the connectivity to the top nodes. Some similarities are negative; at the 
veiy least, you must not convert them into edges. In fact, let’s discard as 
many potential edges as posslble, as long as the network remains connected. 
In this example, the slicing threshold of 0.375 was chosen by trial and error 
(see detalls on page 80). Note that this value is stili statistically low: one would 
hardly consider 0.375 a significant correlation! 

After slicing, arrange the survlving edges into a network and plot it. (Call function 
nx.from_pandas_dataframe(df,source,target) from Ad/ocency Matrix, the Pandas Way, on 
page 73 if you wantto create a weighted network.) The flgure on page 181 shows 
the correlation-based network of foods. Compared to the prevlous flgure, the 
network has the same nodes (and in the same locations), but fewer edges. 

You can extend the proposed algorithm to project bipartite networks using 
Euclidean, cosine, and any other reasonably defined distance measure. All 
of them have a subtle problem: they assume that all the top nodes are inde- 
pendent, and adjacency to each of them is equally important. Sometimes this 
assumption is correct; sometimes it is not. Consider the nutrients from our 
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dataset. It includes vltamins B6 and B12, niacin (also a vitamin B), and 
liboflavin (yet another kind of vitamin B). All projection algorithms considered 
so far treat these four nutrient nodes separately. If a food item provides B6 
but not B12 and another item provides B12 but not B6, they are not consid¬ 
ered similar. But they would be—if you merged the four specific vitamin B 
nodes into one umbrella node. 

If you have a strong reason to belleve that some top nodes are more similar 
to each other than the others, you may want to compute the so-called gener¬ 
alized similarity. 


Compute Generalized Similarity 


Traditionally, two bottom nodes are considered similar if 
they are adjacent to the same top node or to a set of same 
top nodes, even though the sameness may be too striet a 
requirement. Kovacs [KovlO] proposed to weaken the defini- 


This section uses 




tion of similarity. The new measure, dubbed “generalized similarity,” treats 
two bottom nodes as similar if they are adjacent to similar top nodes. But 
who decides whether two top nodes are similar? It is the reflexive definition 
of generalized similarity itself: two top nodes are similar if they are adjacent 
to similar bottom nodes. In fact, the algorithm for calculatlng the generalized 
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similarity does not care whether a node Is top or bottom. It splits a bipartite 
network into two parts and reports the similarities for each node pair in each 
class with respect to the nodes in the other class. 

The generalized similarity is computed iteratively. The algorithm repeatedly 
calculates the pairwise Pearson correlations of the nodes in each part of the 
network, gradually transformtng the original Euclidean coordinate system 
into an affine coordinate system. (The angles between the affine coordinate 
axes, in general, are not right.) The angles between the axes that represent 
similar items beeome more acute; the angles between dissimilar items become 
more obtuse. (Remember that originally all items are considered independent, 
thafs why all angles were right.) The Pearson eorrelation calculated in the 
new deformed coordinate system better reflects the similarities of the nodes 
in each network part. 

The process is repeated until the affine coordinate system stabilizes and stops 
morphing. The iterations may take conslderable time. You can put a cap 
either on the maximal number of iterations or the mlnimal deformation 
magnitude at each iteration. A perfect solution for a large network (1,000 or 
more nodes) is usually infeaslble, an 3 rway. 

Module generalized implements the Kovacs algorithm. You ean download the 
module from GitHub^ or the book’s website® as generalized.py. 

Module generalized provides only one function generalized_similarity(G, min_eps=0.01, 
maxjter=50). The function takes a bipartite network and up to two loop termi- 
nation hints and returns a tuple of four values: two unipartite, undlreeted, 
weighted similarity networks; the attained preeision; and the number of 
completed iterations. Start the analysls by calling the funetion: 

bipartite.py 

from generalized import generalized_simllarlty 

bipl, bip2, eps, n_iter = generalized_similarity(N, min_eps=0.001, 

max_iter=100) 

foods, nutrients = (bipl, bip2) if "C" in bip2 else (bip2, bipl) 
SLICINGTHRESHOLD =0.9 

foods.remove_edges_from((nl, n2) for nl, n2, d in foods.edges(data=True) 

if (i['weighf] < SLICING THRESHOLD) 

The rest of the scrlpt identifies and slices the network of interest, then trun¬ 
catos the “weak” edges. The figure on page 183 shows the network of food items 
based on the generalized similarities. 


2. github.com/dzinoviev/generalizedsimilarity 

3. pragprog.com/titles/dzcnapy/source_code 
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The most prominent difference between the latter network and the previous 
two attempts is the redistribution of graph density from the “center” (OK, we 
know that networks do not have eenters!) to the “peripheiy”—to the extent 
that Fatty Fish beeame an isolate. You can merge it back into the giant com¬ 
ponent by pla 3 dng with the SLICING THRESHOLD at the expense of having less 
structure in the other parts of the network if you want. 

As a free b 5 ^roduct of the projection, you got a network of the nutrients, 
nutrients. The twm networks in the generalized similarity analysis problems 
may have considerable size and waste the precious memoiy of your computer. 
If you don’t plan to use them, teli Python: dei nutrients. 

Bipartite network and networks that consist of more than two parts are much 
more common in life than one may be inclined to think. Treatmg a network 
as bipartite gives you additional CNA tools (various types of projections), adds 
another dimension to your prqjects, and empowers you to discover unexpected 
dependencies between the nodes. 

In the next chapter, you will see how to build a network of something seem- 
ingly totally unrelated to networks—psychological trauma types. Not only is 
it cool, but it also helps to diagnose psychiatric disorders! 
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Sometimes, as she sat alone in the arm-chair in her room, she wouid 
begin laughing and crying at the same time, with a sort ofteariess 
grief, or eise relapse into convulsions, and scream out dreadfui, 
incoherentwords in a horrible voice. It was the first dire sorrow which 
she had known in her life, and it reduced her aimost to distraction. 

LeoTolstoy, Russian writer CHAPTER 16 

Case Study: Building a NetWork 

of Trauma Types 

A t 5 rpical dataset for bipartite network construction consists 
of objects and their properties, such that each object has 
several properties and each property is found in several 
objects. In this case study, objects are subjects of a mental 
trauma study (suitably anonymized), and properties are their 
trauma types. You will leam how to derive a network of trauma type (or other 
properties) based on the subjects’ experience. 

Embark on Psychological Trauma 

Exposure to traumatic events is quite common among children and adoles- 
cents. One notable challenge facing trauma researchers is understanding the 
nature and importance of the co-occurrence of exposure to different types of 
psychological trauma. You may wonder if complex network analysis is the 
right (or a right) tool for this task. 

CNA has been indeed widely used in medical and public health research in 
the last decade (see, for example, work by Nicholas Christakis [CF09] and A.-L. 
Barabdsi [BGLll], and the chapters on network epidemiology in Network 
Science. Theory and Applications [Lew09] and Networks, Crowds, andMarkets. 
Reasoning about a Highly Connected World [EKIO])- However, the focus of 
those studies was mainly on social networks or gene networks. We can go 
the extra mile and look at the network of diagnoses—the trauma tjqres— 
defined by their similarity with respect to exposure. Such a network, tf con- 
structed, could be used to classify trauma types, which wouid hopefully 
improve the quality of trauma diagnostics and treatment. In fact, the network 


This chapter uses Pandas, 
NumPy, community, 
generalized. 
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has been constructed, and it indeed substantially improved the quality of 
diagnostics [Network Analysis of Exposure to Trauma and Adverse Events in 
a CUnical Sample of Children and Adolescents [HSZL17J). You have all the 
necessaiy skills not only to reproduce the process but also to look at several 
posslble trauma networks. 

The goal of this ease study is not just to show you how to read CSV files or 
construet similarity-based networks. After all, you have been readlng about 
that stuff for almost two hundred pages. You will see that, given the same 
data, you can transform it into different networks and perhaps even come to 
different conclusions. 

Read the Data, Build a Bipartite NetWork 

As always, the script starts with a fat chunk of import statements. 
jri_code.py 

import pandas as pd 
import numpy as np 
import networkx as nx 

from networkx.algorithms.bipartite import sets, weighted_proiected_graph 

from networkx.drawing.nx_agraph import graphviz_layout 

import scipy.spatial.distance as dist 

from scipy.stats import pearsonr 

import community 

import generalized 

import dzcnapy_plotlib as dzcnapy 

import matplotlib.pyplot as plt 

Boston’s Justice Resource Institute generously provided the dataset for this 
project.^ You can find it in the file jri data.csv. The file is a correctly formatted 
CSV table with Standard delimiters and a header row at the top. Each of the 
nineteen columns represents a trauma type. 

jri_code.py 

matrix = pd . read_csv( "jri_data. csi/" ) 
print (mat rix.columns, mat rix.shape) 

< Index(['SEXUAL_ABUSE', 'SEXUALASSAULT', 'PHYSICALABUSE', 'PHYSICALASSAULT', 
'PSYCMALTX', 'NEGLECT', 'DOMESTIC_VIOLENCE', 'WAR', 'WARNOTUS', 
'MEDICALTRAUMA', 'INJURY_ACCIDENT', 'NATURAL_DISASTER', 'KIDNAP', 
'TRAUMATIC_LOSS', 'FORCED_DISPLACEMENT', 'IMPAIREDCAREGIVER', 

'EXT_INTERPER_VIOLENCE', 'COMMUNITYVIOLENCE', 'SCHOOLVIOLENCE'], 
dtype='obiect') (618, 19) 


1. jri.org 
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The trauma names are reasonably self-explanatoiy, except for PSYC MALTX 
(“physical maltreatment”), WARNOTUS (“war outside the USA”), and 
EXTJNTERPER VIOLENCE (“extended interpersonal violenee”). 

Each row represents one patient (or suhject, as they used to refer to partici- 
pants m social and hehavior studies not so long ago). The original dataset 
has heen already anonymized to preserve the patients’ privacy. The JRI staff 
replaced each patienfs name with a unique integer numher. Since in this 
study we do not care about the patients’ identity at ali, the JRI identifiers 
have heen removed altogether. All we know is that a patient of an unknown 
age and gender has heen exposed to a set of psychological traumas at an 
unknown time. Respectively, the values of the DataFrame matrix are zeros and 
ones (in the floating-point format), depending on whether the patient in a row 
was diagnosed with the trauma in a column or not. 

Let’s build the network of the trauma types four different ways: from Hamming 
similarity, coslne similarity, Pearson correlatlon, and generalized similaiity. 
At the moment, you know that each method evaluates similarity in its way 
and none of the four methods seems to have a ciear advantage over the other 
three. If you randomly commit yourself to one of the methods, you may end 
up with an inaccurate, distorted, or even incorrect network. You will be able 
to select the most efficient analysis tool by the end of this chapter. 

The first and last networks are Induced, so we need a bipartite network 
patients traumas of patients and trauma types first. We will construet the other 
two networks directly from the matrix. The next code fragment prepares the 
bipartite network and double checks if it is indeed bipartite. Note that the 
matrix is the bi-adjacency matrix of the network of interest (explained on 
page 180). 

jri_code.py 

# Make a multi-lndex of patients+traumas 
stacked = matrix.stack() 

# Select the patients who _have_ traumas 
edges = stacked[stacked > 0] .index.tolistO 
patients_traumas = nx.Graph(edges) 
print(nx. is_bipartite(patients_traumas)) 

< True 

The bipartite network is just an intermediate milestone for this project. There 
is no point in vlsualizlng or analyzing it. 
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Build Four Weighted Networks 

Now that you have all the preprocessed data (DataFrame matrix and network 
patients traumas), you can transform it into four weighted networks. 

Each column in the matrix is a 618-dimensional veetor of binary properties of 
the future trauma node: the property of being diagnosed in patient 0; the 
property of being diagnosed in patient 1; and so on. Surely, two trauma types 
are similar if the vectors are similar in some sense. Once the similarities of 
each pair of vectors are known, the process of network construction is 
straightforward and can be implemented as a set of functions—at least for 
the cosine and Pearson distances. 

jri_code.py 

def similarity_mtx(biadj_nitx, similarity_f) : 

Convert a bi-adjacency matrix to a similarity matrix, 
based on the distance measure 

similarity = [[similarity_f(biadi_mtx[x], biadj_mtx[y]) 
for X in biadjmtx] for y in biadi_mtx] 

# Discard the main diagonal of ones 

similarity_nodiag = similarity * (1 - np.eye(biadj_mtx.shape[l])) 
similarity_df = pd.DataFrame(similarity_nodiag, 

index=biadi_mtx.columns, 
columns=biadi_mtx.columns) 

return similarity_df 

The function simiiarity_mtx(biadj_mtx, simiiarity f) takes the bi-adJacency matrtx and 
a similarity measure (a two-argument function that retums the similarity of 
its parameters) and retums the similarity matrix. The matrix always has ones 
on the main diagonal because each node is similar to itself. The function 
removes the main diagonal, which otherwise would resuit in a bunch of self- 
loop edges. 

jri_code.py 

def similarity_net(sim_mtx, threshold=None, density=None): 

Convert a similarity to a sliced similarity network 

stacked = sim_mtx.stack() 
if threshold is not None: 

stacked = stacked[stacked >= threshold] 

else: 

count = int(sim_mtx.shape[0] * (sim_mtx.shape[0] - 1) * density) 
stacked = stacked.sort_values(ascending=False)[:count] 
edges = stacked.reset_index() 
edges.columns = "source", “target" , “weight" 
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network = nx.from_pandas_dataframe(edges, “source", “target" , 

edge_attr= ['Veig/it"] ) 

# Some nodes may be isolated; they have no incident edges 
netwo rk.add_nodes_f rom(sim_mtx.columns) 
return network 

DENSITY =0.35 

The function similarity_net(sim_mtx, threshold=None, density=None) slices the similarity 
matrix and converts the survlvlng entries into the edges of the induced net¬ 
work. Depending on the chosen similarity measure, the interpretation of the 
edge weight differs. It is not fair to use the same slicing threshold for two 
similarity matrices computed with different distances and expect them to he 
comparable. Thafs why the function performs slicing hased either on the 
slicing threshold (for the networks that are hased on the same distance) or 
desired network density. Considering that your four networks emerge from 
four different distance measures, you must use the density-hased mechanism. 
The density of 0.35 seems to produce a nice collection of networks, but you 
are encouraged to experiment with it. 

Two trauma nodes are cosine similar If the angle between their vectors {Cosine 
Distance, on page 171) is small and the cosine of the angle is large. 

jri_code.py 

def cosine_sim(x, y): 

return 1 - dist.cosine(x, y) 

cosine_mtx = similarity_mtx(matrix, cosine_sim) 
cosine_network = similarity_net(cosine_mtx, density=DENSITY) 

Two trauma nodes are Pearson similar if their vectors are positively correlated 
IPearson Correlation, on page 173). An added benefit of Pearson correlation is 
that it comes with a p-value. If you want to consider only statistically signifi¬ 
cant correlations (say, with the p-value<0.01), you can modify the similarity 
function appropriately. 

jri_code.py 

def pearson_sim(x, y): 

return pearsonr(x, y)[0] 

pearson_mtx = similarity_mtx(matrix, pearson_sim) 
pearsonnetwork = similarity_net(pearson_mtx, denslty=DENSITY) 

# Shall we discard the statistically insignificant ties? 
def pearson_sim_sign(x, y): 
r, pvalue = pearsonr(x, y) 
return r if pvalue <0.01 else 0 

pearson_mtx_sign = similarity_mtx(matrix, pearson_sim_sign) 
pearson_network_sign = similarity_net(pearson_mtx_sign, density=DENSITY) 
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The Hamming and generalized similarity networks are weighted and complete 
projections of the patients traumas bipartite network. They already contain all 
edges, and your job is to remove the “weak” edges. The function slice_projected(net, 
threshold=None, density=None) is an equivalent of similarity_net(sim_mtx, threshold=None, 
density=None) and removes the edges that have small weight or make the network 
too dense. 

jri_code.py 

def slice_proiected(net, threshold=None, density=None): 

Slice a projected similarity network by threshold or density 
if threshold is not None: 

weakedges = [(nl, n2) for nl, n2, w in net.edges(data=True) 
if w["weight"] < threshold] 

else: 

count = int(len(net) * (len(net) - 1) / 2 * density) 
weak_edges = [(nl, n2) for nl, n2, w in 

sorted(net.edges(data=True), 

key=lambda x: x[2] ['Veig/it"] , 
reverse=T rue)[count:]] 
net.remove_edges_from{weak_edges) 

Two trauma nodes are Hamming similar if the trauma types have been frequent- 
ly observed together in the same patients [Hamming Distance, on page 167). 

jri_code.py 

neti, net2 = sets(patients_traumas) 

_, traumas = (neti, net2) if "WAR" in net2 else (net2, neti) 
hammingnetwork = weighted_proiected_graph(patients_traumas, 

traumas, ratio=True) 

slice_proiected(hamming_network, density=DENSITY) 

Two trauma nodes are generally slmllar if the trauma tyqres have frequently 
been observed in similar patients [Generalized Similarity, on page 174). This 
piece of code incidentally also generales a similarity network of the patients, 
which you do not need for this project. 

jri_code.py 

neti, net2, eps, n = generalized.generalized_similarity{patients_traumas) 

_, generalized_network = (neti, net2) if “WAR" in net2 else (net2, neti) 
slice_proiected(generalized_network, density=DENSITY) 

generalized_network.remove_edges_from(generalized_network.selfloop_edges()) 

Congratulations! You aspired to compute a network of trauma types. Instead, 
you got four (or five, if you count two versions of the Pearson network as 
two networks). But which one is the best and which are on the chopping 
block? 
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Plot and Compare the Networks 

All four networks have the same number of nodes and similar density (and 
a similar number of edges), which makes them veiy easy to compare. The 
following picture shows the charts of all four networks. 



cosine 


hamming 



The difference between the tidy generalized similarity network cleanly sepa- 
rated rnto two components, and its brethren, is striking. You can stili see 
some structure in the Pearson network, but the other two graphs are nearly 
random. 

The numerical experiment with community structure extraction (see Outline 
Modularity-Based Communities, on page 136) confirms: the first network has 
an acceptable modularity of 0.47 and three network communities. The other 
networks are not veiy modular. 

jri_code.py 
networks = { 

"generalized" : generalized_network, 

"pearson" : pearson_network_sign, 

"cosine" : cosine_network, 

"hamming" : hamming_network, 

} 
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partitions = [community.best_partition(x) for x in networks.values{)] 
statistics = sorted([ 

(name, 

community.modularity{best_part, netw), 
len(set(best_pa rt.values())), 
len(nx.isolates(netw)) 

) for (name, netw), bestpart in zlp(networks.items(), partitions)], 
key=lambda x: x[l], reverse^True) 

The followlng table shows the modularity-related statistics of the four networks. 


Similarity type 

Number of 

isolates 

Modularity 

Number of 

communities 

Generalized 

0 

0.47 

4 

Pearson 

0 

0.20 

4 

Cosine 

4 

0.04 

6 

Hamming 

6 

0.00 

7 


The generalized similarity network has no isolated nodes, the highest modu- 
larity, and the smallest number of detected communities. Its community 
structure partitions the trauma t 3 rpes into the smallest number of well-defined 
groups of slmllar size. These groups are compact and homogeneous, and with 
some insignrficant effort can be labeled, as shown in the following table. 

Group label Trauma types 

Personal violence SEXUAL_ASSAULT, SEXUAL_ABUSE, KIDNAP, PSYC_MALTX, WAR, 
PHYSICAL_ABUSE, PHYSICAL_ASSAULT 

Medical traumas WAR_NOT_US, TRAUMATIC_LOSS, INJURY_ACCIDENT, MEDICALJRAUMA 
Societal traumas SCHOOL_VIOLENCE, COMMUNITY_VIOLENCE, EXT_INTERPER_VIOLENCE 
Neglect and IMPAIRED_CAREGIVER, FORCED_DISPLACEMENT, DOMESTIC_VIOLENCE, 

relocation NATURAL_DISASTER, NEGLECT 

This impressive summary completes your analysis of the bipartite network 
of patients diagnosed with psychological traumas. You started with tabular 
clinical data pertaining to the trauma cases, and explored four ways of con- 
verting the data to a weighted network. You compared the networks, selected 
the one with the highest modularity, and identrfied four trauma clusters. (The 
number of discovered clusters is unrelated to the number of weighted net¬ 
works.) The results of the study could be even more descriptive if you consid- 
ered the temporal sequences of traumatic events. 
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In the Next Part 

Your common sense may have suggested that some traumas happen only in 
a particular order. For example, SCHOOL VIOLENCE does not happen until a child 
goes to school. The eonstructed network of traumas cannot reflect the 
sequenclng because it is undirected. The best it can do is to mark two trauma 
types as similar (if they are). You will work with directed networks in the next 
part of the book. 


report erratum • discuss 


Part V 

When Order Makes a Difference 


Even in the simplest social ego-network, the edges 
often have directions: their start and end nodes 
have different semantics. In this part, you willfa- 
miliarize yourself with directed networks, in partic- 
ular with directed acyclic graphs, and partitioning 
them into equivalence classes. 




You can't get there from here. 

Stereotypically attributed to people from Maine 


CHAPTER 1 7 

Directed Networks 


Are you a robber or being robbed? Did the Yankees win over the Red Sox or lose 
to them? Does flsh oil provide omega acids or the other way around? Some 
networks are inherently as3mimetric, but we never talk about them—untU now. 

In this chapter, you will learn how to identify asymmetrie relationships 
between items and build and handle directed networks. In the end, you will 
be able to check if a directed network is a directed acyclic graph, and if it is, 
establish a partial order of the nodes by performing a topologlcal sort. 

Discover Asymmetrie Relationships 

A directed network is a network that has at least one directed edge. Naturally, 
a directed edge is an edge that has a direction: it connects node X to node Y, 
but not the other way around. Mathematically, the relationship represented 
by a directed edge is as3nnmetric. 

Many real-world relationships are asymmetrie and, ideally, must be modeled 
as directed networks. Here are some examples of as3mimetric or possibly 
asymmetrie relationships: 

In social networks: 

• Friendship: Alice may believe she is a friend of Bob, but Bob may have 
a different opinion. In the not-so-rare case when Alice and Bob are 
mutual friends, you can either model thetr friendship as an undtrected 
edge or create two anti-parallel directed edges: one from Alice to Bob 
and the other from Bob to Alice. 

• Subordination: if Alice is a subordinate of Bob, then Bob is not a 
subordinate of Alice. 

• Some family relationships, such as parenthood: if Alice is a parent of 
Bob, then Bob is not a parent of Alice. 
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In semantic networks: 

• Being a hypem3rm (a more general word) or a h3rponym (a more specific 
Word): color is a h 57 pemym of red because red is always a color, but 
the converse is not true. 

In other networks: 

• Membership: Alice can be a member of an organization, but the 
organization carmot be a member of Alice. 

• Sequencing: rf A happens after B, then B does not happen after A. 

• WWW links: a link from one web page to another does not imply a 
reciprocal link. 

• Flow: any flow from one node to another (including flows of goods, 
people, money, and information) is asymmetric and must be modeled 
as a directed edge. In particular, one-way streets in a transportation 
network are directed edges. 

Even forward and backward references in a book establish an as3mimetric 
relationship between the book units (such as chapters). The figure shows the 
network of chapters of this book. An edge in the figure connects a chapter to 
another chapter if there is at least one reference in the former chapter to the 
latter chapter. 
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Remember that NetworkX draws rectangles to represent edge arrows. If this 
unusual notatlon Is difficult to get used to, swltch to Gephi for vlsualizatlon. 
A chapter whose node has more ineident rectangles has more external refer- 
ences and should be an earlier chapter in the book if the goal is to avoid the 
forward references that some editors consider harmful. 

Directed or Signed? 

You may feel that the asymmetiy of directed networks has something in 
common with the as3Tnmetry of signed networks [Signed Networks, on page 
60 ) and wonder rf signed and directed networks model the same aspects. No, 
they do not. Signed edges (just like all other weighted edges) represent the 
mtensity of the relationship. Directed edges represent its reciprocity. An edge 
can be (and often is) directed and weighted/signed at the same time. The 
following table shows how directedness and weight capture different aspects 
of a simple interpersonal relationship. 



Signed (-) 

Unsigned 

Signed (+) 

Directed 

Alice hates Bob 

Alice knows Bob 

Alice likes Bob 


(but Bob does not 

(but Bob does not 

(but Bob does not 


hate Alice). 

know Alice). 

like Alice). 

Undirected 

Alice and Bob 

Alice and Bob are 

Alice and Bob 


are foes. 

acquaintances. 

are friends. 


If you represented the same relationship with a “vantlla” unweighted, undirected 
edge, all the nuances of the Alice and Bob affinity would be irrecoverably lost. 


Explore Directed Networks 

The directedness of edges dramatically affects almost all network measures and 
structural elements. Let’s have a look at some affected properties. As an example, 
let’s use a directed network of the top three preferred migration destmations for 
each state in 2015, constructed from the United States Census Bureau State- 
to-State Migration Flows dataset. ^ You can find the P3dhon code for the network 
construction in the file migrations.py. The picture on page 200 shows the network 
sketch (the meaning of the colors wfll be explatned later on page 202 ). 

Degree 

Each node in a directed graph G has three degrees: G.in degreeO (the number 
of incoming incident edges), G.out degreeO (the number of outgoing incident 


1. www.census.gov/data/tables/time-series/demo/geographic-mobility/state-to-state-migration.html 
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edges), and the total G.degreeO (the number of all edges). Note that the total 
degree is the sum of the indegree and outdegree. 

The indegree and total degrees of the migration graph designate the most 
attractive destmations (whieh are, not surprisingly, sunny Califomia, Florida, 
and Texas). By eonstruetion, all nodes have the same outdegree of 3 . 

sorted(G.ln_degree().items(), key=lambda x: x[l], reverse=True)[:3] 
sorted(G.outdegree!).items(), key=lambda x: x[l], reverse=True)[:3] 
sorted(G.degree!).items(), key=lambda x: x[l], reverse=True)[:3] 

< [!'CA', 21), !'FL', 17), !'TX', 16)] 

[!'KY', 3), !'MT', 3), !'MS', 3)] 

[!'CA', 24), !'FL', 20), !'TX', 19)] 

Neighbors 

A node in a directed graph has two t3rpes of neighbors: G.successorsO (reachable 
through the outgoing edges) and G.predecessorsO (reachable through the 
mcoming edges). The method G.neighbors!) is another name of G.successorsO. In 
the migration network, the successors of the final destination (CA) are the pre- 
ferred destinations of the outgoing migration. The successors are the States 
from whieh migrants come to Califomia. 

finaldestination = sorted!G.indegree!).items!), key=lambda x: x[l], 

reverse=True)[0][0] 

coming_from = G.predecessors!final_destination) 
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< ['LA', 'TX', 'UT', 'ND', 'NY', 'NM', 'NV, 'VA', 'WA', 'ID', 'AK', 'MO', 

'CO', 'OR', 'MT', 'HI', 'KS', 'AZ', 'IL', 'MA', 'OK'] 

going_to = G.successors(final_destination) 

< ['NY', 'TX', 'AZ'] 

Walks, Traiis, and Paths 

A walk in a network is stili any sequence of edges such that the end of one 
edge is the beginning of another edge (see Think in Terms of Paths, on page 
88). However, a directed edge has only one end and one beginning, while an 
undirected edge may begin at any of its incident nodes. Some undirected 
walks may become broken as a resuit of this additional restriction (think of 
encountering a one-way Street going in the “wrong” direction). 

Centralities and Other Distances 

Each node in a directed graph has three degree centrahties (G.in degree centralityO, 
G.out_degree_centrality(), and G.degree_centrality()), based on the namesake degrees. 
The other tyrpes of centralities—closeness, betweenness, and eigenvector—are 
calculated the same way for directed and undirected networks, but the results, 
in general, differ because of the different neighborhoods and paths. The latter 
is also true about the center, diameter, radius, eccentricity, and the periphery 
of a graph [Networks as Circles, on page 91). 

Components 

A directed network has two types of components, as explained in Split Net¬ 
works into Connected Components, on page 126. In a strongly connected 
component, any member node is reachable from any other member node. 
(There is a migration flow from any state to any state, perhaps through some 
intermediate states in the same component.) In a weakly connected compo¬ 
nent, any member node would be reachable from any other member node 
if all edges were converted to undirected. (There is a migration flow either 
from or to any state, perhaps through some intermediate states in the same 
component.) 

sorted(nx.weakly_connected_components(G), key=len, reverse=True) # Only one! 
sorted(nx.strongly_connected_components(G), key=len, reverse=True) 


< [{'ND', 

'MD' , 

'IL' , 

'CA' , 

FL' 

'NE' , 

'IN' , 

'NH' , 

'GA' , 

'ME' 

'UT' 

'PA 

'AZ' , 

'PR' 

'VT' , 

'MT' , 

NJ' 

'MA' , 

'WV' 

'AK' , 

'DC' , 

'MN' 

'TX' 

'AL 

'NM' , 

'MO' , 

'WI' , 

'WA' , 

OR' 

'LA', 

'NV' , 

'IA' , 

'NC' , 

'MS' 

'CO' , 

'WY 

'HI' , 

'SC' , 

'TN' , 

1— 

73 
1—1 

'DE' , 

'AR' 

'OK' , 

'NY' , 

'KY' 

'OH' , 

'KS 

'MI' , 

'ID' , 

'VA' , 

'SD'}] 
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[{'MD', 'IL', 'CA', 'FL', 'IN', 'GA', 'PA', 'UT', 'AZ', ' NJ' , 'DC, 'TX', 

'MO', 'WA', 'LA', 'OR', 'NC, 'MS', 'SC, 'TN', 'OH', 'NY', 'KY', 'KS', 

'MI', 'ID', 'VA'}, {'NE', 'WI', 'ND', 'MN', 'IA'}, {'NH' , 'ME'}, {'MA'}, 
{'VT'}, {'PR'}, {'MT'}, {'AL'}, {'NV'}, {'CO'}, {'WY'}, {'CT'}, {'RI'}, 
{'DE'}, {'OK'}, {'AR'}, {'SD'}, {'NM'}, {'WV'}, {'AK'}, {'HI'}] 

Function nx.condensation(G) calculates the condensation of G. A condensation 
is an Induced directed graph whose nodes represent strongly connected 
components of G, and edges represent bundles of the original edges, in the 
same spirit as explained on page 133. All original graph nodes within an 
induced node of the condensation are definitely reachable from each other. 
If your goal is to study graph reachability, replaclng a strongly connected 
component wlth one node does not affect your findings, but makes the 
problem simpler. 

A strongly connected component is called attracting if it has no outgomg edges 
whatsoever. NetworkX offers functions nx.attracting_components(G) and nx.attracting_com- 
ponent_subgraphs(G) to obtaln the attracting components. The ftgure on page 200 
shows the nodes in the attracting component in green. Once you move into 
a “green” state, you will likely stay in the “green” state for good. 

sorted(nx.attracting_components(G), key=len, reverse=True) 

< [{'MD', 'IL', 'CA', 'FL', 'IN', 'GA', 'PA', 'UT', 'AZ', ' NJ' , 'DC, 'TX', 

'MO', 'WA', 'LA', 'OR', 'NC, 'MS', 'SC, 'TN', 'OH', 'NY', 'KY', 'KS', 

'MI', 'ID', 'VA'}] 

Reversal and Flattening 

You cannot live your life backward or gather spilled milk, but you can reverse 
a directed graph with the method G.reverseO. The function retums a copy of 
the original graph with each edge reversed. The indegrees, outdegrees, suc- 
cessors, and predecessors in the new graph are the outdegrees, indegrees, 
predecessors, and successors of the original graph, respectively. Both graphs 
have the same weakly and strongly connected components. If your graph 
represents consequences for each cause, the reversed one shows all the 
causes for each consequence. 

Finally, NetworkX provides a tool for getting rid of directedness altogether. 
Method G.to_undirected(reciprocal=False) retums an undirected copy of a directed 
graph. If the parameter reciprocal is True, then the function connects two nodes 
with an undirected edge only if they are already connected by a pair of 
antiparallel directed edges. Otherwise, directed edges are demoted to undirect¬ 
ed edges, and any posslble resultlng palrs of parallel edges are merged. 
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Apply Topological Sort to Directed Acyclic Graphs 

A directed acyclic graph (DAG) Is a speclal type of directed network. As the 
name suggests, It Is an acyclic network—a network ttiat does not contaln any 
cycles (cycles are explalned on page 89). Vlsually, a DAG Is a tree, a forest, a 
star, or a llnear graph—for example, the llnear graph and the tree m the flgure 
on page 4 are DAGs. 

Directed acyclic graphs describe hlerarchles—systems In whlch thelr compo- 
nents are ranked one above the other accordtng to some property. (A hlerarchy 
Is often Informally referred to as a “pecklng order”: who pecks whom?) In a 
hlerarchy, any two components A and B are elther unrelated, or A Is unamblgu- 
ously subordlnated to B, or B Is unamblguously subordlnated to A, elther 
dtrectly or Indlrectly. On the contraiy, subordlnatlon Is amblguous In directed 
graphs wlth cycles. For example. In a two-node rlng conslstlng only of A and 
B, both nodes can clalm that they supervlse the other node. 

^ Pecking Order ^ 

The edges of a directed acyclle graph often represent a domlnanee hlerarchy or sub¬ 
ordlnatlon. The source node of an edge Is the “boss," and the target node Is a “subor- 
dlnate." Incldentally, domtnanee In chlckens Is asserted by pecklng. The “top” 
chlcken pecks a more Inferior chlcken, whlch, In tum, pecks an even more Inferior 
chlcken, all the way down to the “bottom” chlcken. Pecklng order effectlvely executes 
a topological sort and leads to soclal stratlflcatlon. 

V J 

All NetworkX functlons and technlques for directed networks naturally work 
for DAGs, but several functlons are Intended solely for DAGs. Functlon 
nx.is_directed_acyclic_graph(G) checks If G Is a DAG or not. Functlon nx.transitive_clo- 
sure(G) calculates a transitive closure T of G: a graph that has the same nodes 
as G such that two nodes In T are adjacent If and only If there Is a path between 
the two nodes In G. Thlnk of a transitive closure as a graph of all posslble 
subordlnatlon relatlonshlps, both dlrect and Indlrect. 

You can serlallze a DAG and arrange all nodes In a llnear order, so that the 
next node may be a subordlnate of the prevlous node, but the prevlous node 
Is never a subordlnate of the next node. The resuit of the serlallzatlon Is a 
ranklng of all nodes, wlth the source nodes at the beglnnlng and target nodes 
at the end. Thls operatlon Is called topological sort. You can sort a DAG In 
many different ways, resultlng In different “pecklng” ranklngs. Functlon 
nx.topological_sort(G) retums one randomly chosen ranklng as a llst of node labeis. 
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nx.topological_sort(G) 


NM' , 

'DC' , 

'AR' , 

'HI' , 

'MI' 

'SC' , 

'PR' , 

'MS' , 

'TN' , 

'CT' , 

'AL' , 

'ME' 

OR' , 

'VT' , 

'UT', 

'DE' , 

'NC' , 

'NH' , 

'WY' , 

'CO' , 

'OK' 

'IN' , 

'AK' , 

'WA' 

SD' , 

'AZ' , 

'KS' , 

'MO' , 

'RI' , 

'MA' , 

'LA' , 

'TX' , 

'MD' , 

'NE' , 

'IA' , 

'NV' 

MT' , 

'ID' , 

'WV' , 

'VA' , 

'KY' , 

'OH' , 

'PA' , 

'NJ' , 

'ND' , 

'MN' , 

'WI' , 

'IL' 

CA' , 

'GA' , 

'FL', 

'NY' ; 










A topologlcal sort order is not too useful because it focuses on what is 
impossible rather than on what is definite. You can teli from the order that 
New Mexico (NM) is not one of the top five destinations for the residents of 
New York (NY), but you cannot claim that New York is one of the top five 
destinations for the inhabitants of New Mexico. 


Master "toposort" 


Directed network analysis has an unexpected connection to 
Creative writing and computer game development. 


This section uses Pandas, 
community, toposort. 


Game developers and Creative writers are often in need of a 
collection of adjectives that characterize a particular property and range from 
“veiy bad” to “veiy good.” Directed network analysis (via the module toposort) 
makes it possible to design such a scale in any natural language. 


Obtain and Extract Survey Data 

You can start this mini case study by defining a list of candidate adjectives: 
m our case, the list consists of thirty-four words: “alpha plus,” “average,” 
“bad,” “crappy,” “disgusting,” “excellent,” “excittng,” “f*cking good,” “fantastic,” 
“fUthy,” “first-class,” “good,” “great,” “horrible,” “lousy,” “magical,” “mediocre,” 
“pathetic,” “nice,” “none of a,” “normal,” “not bad,” “phenomenal,” “premium,” 
“repugnant,” “shitty,” “so-so,” “solid,” “strong,” “superb,” “superior,” “unfit,” 
“weak,” and “worthless.” 



"Babushka," "Sputnik," "Balalaika"... 

The origmal data for this case study was collected by me m 2016 
in the Russian language and later translated into English. You 
may argue that the perception and interpretation of qualitative 
adjectives in the two languages differ, and you are probably right. 
However, the goal of the project is to introduce and illustrate the 
technique, rather than produce an actual highly reliable line of 
adjectives. 
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Post the list as a survey to Qualtrics,^ SurveyMonkey,® or your other favorlte 
survey-taking site. You have to design the questlonnaire in such a way that 
the takers either rank all words in the order from the “best” to the “worst” or 
assign a numerieal measure of “goodness” to each word. The survey design 
is outside the scope of this book, but keep in mind that asking survey takers 
to arrange thirty-four words on a cellphone screen may be more than an 
average person is ready to commit to. 

Once you collect enough samples, download the results as a CSV file (say, 
Adjectives by the rank.csv). Depending on the surveying site, the file may need a 
lot of cleanup before becomlng useful. The following code fragment imports 
a CSV file produced by Qualtrics, extracts the thirty-four columns that corre- 
spond to the word ranks, and removes the survey question from the column 
names. 

adjectives.py 

ranks = pd. rea(i_csy {"Adjectives_by_the_rank. csv" , 

header=l).set_index( “ResponselD" ). fillna(0) 

Q1 = “Rank the words from the most positive to the most negative-" 
ranks = ranks.loc[:, ranks.columns.str.startswith(Ql)].astype(int) 
ranks.columns = ranks.columns.str.replace(Ql, "") 

Let’s now build a network of words. Each column of the DataFrame ranks repre- 
sents the ranks of a word from each participant. Connect the word i to 
another word j with a directed edge if the participants agree, to some extent, 
that i is “better” than j. The defmition of what constitutes the agreement may 
be stringent (by consensus), weak (when at least two participants agree), or 
somewhere in the middle (say, at least 115 of 158 participants agree). The 
consensus-based network would have very few, if any, edges. The network 
based on the weak criterion may have too many edges and contain cycles. 
We want to construet a network that is dense but stili has no cycles, because 
If it is not a DAG, then it carmot be topologically sorted. 

adjectives.py 

dominance = pd.DataFrame([[(ranks[j] > ranks[i]).sum() 

for i in ranks] for j in ranks], 
columns^ranks.columns, index=ranks.columns) 

QUORUM = 115 

edges = sorted{dominance[dominance >= QUORUM].stack{).index.tolist{)) 

G = nx.DiGraph(edges) 


2. www.qualtrics.com 

3. www.surveymonkey.com 
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The resulting network G has thirty-four nodes (one node per qualitative word) 
and 497 edges. It consists of one dense weakly connected component shown 
m the following figure. 



Amazingly, the graph is a DAG. 
nx.is_directed_acyclic_graph(G) 

< True 

Unfortunately, it looks incomprehensible. 

Execute Topological Sort 

You can tiy to hring some order hy topologically sorting the network [Apply 
Topological Sort to Directed Acyclic Graphs, on page 203): 

adjectives.py 

# Sort in the reverse order 
print(nx.topological_sort{G) [::-!]) 

< ['exciting', 'fantastic', 'phenomenal', 'superior', 'fIrst-class', 'magical', 

'f*cking good', 'superb', 'premium', 'alpha plus', 'great', 'excellent', 
'strong', 'good', 'solid', 'not bad', 'nice', 'normal', 'average', 

'mediocre', 'none of a', 'so-so', 'weak', 'bad', 'unfit', 'worthless', 
'pathetic', 'lousy', 'shitty', 'horrible', 'filthy', 'repugnant', 

'disgusting ' , 'crappy'] 
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The output of the function Is unbelievably realistic. “Fantastic” is undeniably 
better than “great,” whlch is better than “lousy,” which is better than “disgust- 
Ing.” The only problem with the function nx.topological_sort(G) is that it retums 
only one possible topological sort order. It forcefully ranks the nodes that are 
otherwise topologically equivalent, adding unnecessaiy constraints to the way 
the node labeis can be used. There is no way to obtain another order with 
nx.topological_sort(). 

The module toposort* provides the function toposort.toposort(edge_dict) that does 
not have the limitations of the function nx.topological_sort(). This function retums 
a generator of sets of topologically equivalent nodes. A node in a set does not 
dominate and is not dominated by any node in the same set. Game developers 
and Creative writers, the prospective users of the word sets, would treat all 
words in one set as having the same sentiment (but not necessarily the same 
valence). 

The function toposort.toposort(edge_dict) is not integrated with NetworkX. Before 
using the function, transform a NetworkX edge list G.edgesO into a dictionaiy 
where nodes are keys, and sets of their neighbors are values. 

adjectives.py 

edge_dict = {nl: set(ns) for nl, ns in nx.to_dict_of_lists(G) .itemsO} 
topo_order = list(toposort.toposort(edge_dict)) 
print (topo_order) 

< [{'phenomenal', 'exciting', 'fantastic'}» {'f*cking good', 'magical', 

'first-class', 'great', 'superb', 'superior'}, {'alpha plus', 'excellent', 

'premium'}, {'solid', 'strong', 'good'}, {'normal', 'nice', 'not bad'}, 
{'average'}, {'none of a', 'mediocre', 'so-so'}, {'weak'}, {'worthless', 
'unfit', 'pathetic', 'bad'}, {'lousy'}, {'shitty', 'filthy', 'horrible', 
'crappy'}, {'repugnant', 'disgusting'}] 

The new output is “phenomenal,” “exciting,” and “fantastic.” It consists of 
twelve equivalence classes of word, each class being “worse” than the prede- 
cessor and “better” than the successor. Despite being based only on 158 
responses, the resuit does not look unexpected. The toposort algorithm is 
even “smart” enough to put the unappetiztng adjectives in the second set from 
the end together. 

In this chapter, you learned how to identlfy, capture, and explore (with topo¬ 
logical sort) any as 5 nnmetric relationships between network nodes. IncidentaUy, 
this chapter concludes the main body of the book. Whether or not you were 
a seasoned complex network analyst and I^thon programmer at the beginning 


4. pypi.python.org/pypi/toposort 
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of the book, now you are. You will never see the world around you the same 
way, because when all you know is CNA, everything looks like a network. 

In the Appendix 

Just like almost eveiything, NetworkX evolves. The second major version of the 
library has been released as thls manuscript was in preparation. You will 
read about the new NetworkX 2.0 in Appendix 2, NetworkX 2.0, on page 213. 
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APPENDIX 1 


NetWork Construction, Five Ways 

Thls appendix compares the ways of constructing the Lincoln family tree 
network on page 4 (shown in the following figure) in pure Python and using 
the four toolkits from Chapter 2, Surveying the Tools of the Crqft, on page 11 . 



Pure Python 

The most natural way to describe a network in pure P 3 dhon is to represent 
each edge as a tuple or list of two nodes and collect all edge tuples or list in 
another list—the edge list. 
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make-network-type-figures.py 
lincoln_list = [ 

("A.L.", "Edward Baker L. “) , {''A.L.'', “Robert Todd L.“) , 

("A.L.", "William Wallace L."), ("A.L.", "Thomas L. III"), 

(“Jessie Harlan L.", “Mary L. Beckwith") , 

(“Jessie Harlan L.“, “Robert Todd L. Beckwith") , 

(“Mary L.", "L. Isham") , (“Robert Todd L.“, “A.L. II"), 

("Robert Todd L." , “Jessie Harlan L.“), 

(“Robert Todd L.“, “Mary L."), (“Thomas L.", “A.L."), 

("Thomas L.“, “Sarah L. Grigsby"), (“Thomas L.“, “Thomas L. Jr.“), 

] 

The prevlous example has at least three major issues: 

• Isolated nodes (nodes without edges) cannot be represented because they are 
not incident to any edges (correctness issue). 

• Lists have linear search time (performance issue). 

• Node labeis are replicated for each Incident edge (memory footprint issue). 

You could mitigate the first two issues by representing the network as a dictionary, 
where the keys are node labeis, and the values are sets of the adjacent nodes: 

lincolndict = { 

“A.L.“\ {“Thomas L. III", “Edward Baker L.“, “William Wallace L.", 

“Robert Todd L . "}, 

“Jessie Harlan L.“\ {“Robert Todd L. Beckwith" , “Mary L. Beckwith"} , 

“Mary L."-. {“L. Isham"}, 

“Robert Todd L.": {“Mary L.“, “A.L. II", "Jessie Harlan L."}, 

“Thomas L.“\ {"Sarah L. Grigsby", “A.L.", “Thomas L. Jr.“}, 

“George W. " : set()}, 

Note how we incorporated George W[ashington] into the network, despite his not 
having incident edges yet. (This network representation in used in Apply Topolog- 
ical Sort to DirectedAcyclic Graphs, on page 203.) The new approach, however, has 
its own problems: 

• Some nodes that have only incoming links (such as “Thomas L. 111”) are now 
on the second level of the hierarchy and hard to find. 

• You cannot have parallel edges that connect the same two nodes more than 
once. 

• Node labeis are stili replicated for each incoming incident edge. 

iGraph 

Here’s how iGraph deals wlth the Lincoln graph. The edge list contains numerical 
node identlfiers rather than labeis, but you can add the labeis later as node 
attributes. 
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import igraph 

edges = [(1, 6), (1, 7), (1, 5), (1, 12), (2, 4), (2, 9), (3, 13), (7, 0), 

(7, 2), (7, 3), (8, 1), (8, 10), (8, 11)] 
labeis = [ 

"A.L. II", "A.L.", "Thomas L." , "Jessle Harlan L." , "Nary L. Beckwith" , 
"Sarah L. Grigsby" , "Edward Baker L.", “Nary L.", “William Wallace L.“, 
"Robert Todd L.“, "Robert Todd L. Beckwith" , “Thomas L. Jr.“, 

“Thomas L. III", “L. Isham" , “George IV."] 

G = igraph.Graph(edges, directed=True) 

G.add_vertex(14) 

G.vs ["name"] = labeis 
print(G) 

< IGRAPH DN-- 15 13 -- 
+ attr: name (v) 

+ edges (vertex names): 

A.L.->Edward Baker L., A.L.->Mary L., A.L.->Sarah 
L. Grigsby, A.L.->Thomas L. III, Thomas L.->Mary L. Beckwith, 

Thomas L.->Robert Todd L., Jessie Harlan L.->L. Isham, Mary 
L.->A.L. II, Mary L.->Thomas L., Mary L.->Jessie 

Harlan L., William Wallace L.->A.L., William Wallace L.->Robert Todd L. 
Beckwith, William Wallace L.->Thomas L. Jr. 

Alas, the unconnected nodes are not shown on the printout! Poor George W. 

graph-tool 

The graph-tool version of the Lincoln graph starts wlth the same edge list and 
a list of labeis. All vertices are added at once, followed by all edges. graph-tool 
treats labeis as vertex properties. 

import graph_tool 

edges = [(1, 6), (I, 7), (1, 5), (1, 12), (2, 4), (2, 9), (3, 13), (7, 0), 

(7, 2), (7, 3), (8, 1), (8, 10), (8, 11)] 
labeis = [ 

“A.L. II", "A.L.", “Thomas L.", “Jessle Harlan L." , "Nary L. Beckwith" , 
“Sarah L. Grigsby", "Edward Baker L.“, “Nary L.“, “William Wallace L.“, 
“Robert Todd L.“, “Robert Todd L. Beckwith" , “Thomas L. Jr." , 

“Thomas L. III", “L. Isham", “George IV."] 

G = graph_tool.Graph() # Dlrected by default 
nodes = G.add_vertex(len(labels)) 

G.add_edge_list(edges) 

# Add node labeis (“vertex properties") 
names = G.new_vertex_property( "string" ) 
for node,label in zip(nodes, labeis): 
names[node] = label 
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NetworkX 

The striklng difference between the NetworkX code and the other previously 
shown code fragments is the ahility to add named edges and nodes directly 
to the graph, which is a natural way of huilding a real-world complex network. 

make-network-type-figures.py 
lincoln_list = [ 

("A.L.", “Edward Baker L.“) , ("A.L." , “Robert Todd L.“) , 

("A.L.", "Wllliam Wallace L."), ("A.L.", "Thomas L. III"), 

{"Jessie Harlan L.", "Mary L. Beckwith") , 

{"Jessie Harlan L.", “Robert Todd L. Beckwith") , 

{"Mary L.", "L. Isham") , {“Robert Todd L.", "A.L. II"), 

{"Robert Todd L.", "Jessie Harlan L."), 

{“Robert Todd L.“, "Mary L."), {"Thomas L.", “A.L."), 

{"Thomas L.", "Sarah L. Grigsby") , {"Thomas L." , "Thomas L. Jr.“), 

] 

import networkx as nx 

G = nx.DiGraph(lincoln_list) # Directed! 

G.add_node("George IV.") 

NetworKit 

You can convert a previously constructed NetworkX graph G Into a NetworKit graph 
nkG with a call to one function. 

import networkit.nxadapter 
G = ... # Construet a NetworkX graph 
nkG = networkit.nxadapter.nx2nk(G) 
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M/e cannot be certain ofbeing rightabout the future; but we can be almost 
certain ofbeing wrong about the future, ifwe are wrong about the past. 
Gilbert Keith Chesterton, English writer, poet, philosopher, 
dramatist, journalist, orator, lay theologian, biographer, and literary 
and artcritic 


APPENDIX 2 



A new verslon of NetworkX —2.0—^was released in September 2017. The new 
version is only partially compatible with the stable version used in the book. 
The book deseribes over one hundred NetworkX funetions. Some of them have 
been affeeted by the transition. 

Surely, the new version does not instantly make version 1.11 obsolete. 2.0 
will be gradually phased in, while 1.11 will be phased out. In this appendix, 
you will read about the most striklng differenees between the old and the new 
versions, whieh will help you to adjust your CNA seripts to work with the 
most recent version of NetworkX. A complete migration guide is available online.^ 

From Containers to Vietus 

• Function nx.subgraphO and methods G.subgraphO, G.neighborsO, G.reverseO, 
G.to_directed(), and G.to_undirected(), to name a few, return View objects 
rnstead of graphs. A view refers to the underlying graph. Any node, 
edge, or attribute change in the underl 5 ring graph affects all of the 
associated views. 

• New attributes G.nodes and G.edges contarn dict()-like NodeView and EdgeView 
objects, respectively. Methods G.nodesO and G.edgesO return the name- 
sake views, too. 

• New attributes G.degree, G.in_degree, and G.out_degree contarn dict()-like 
DegreeView objects. 

Usage Change 

• Function bipartite.sets() raises an AmbiguousSoiution exception if the input 
bipartite graph is disconnected, because it is not possible to separate 
bipartite sets in a discormected graph unambiguously. 


1. networkx.github.io/documentation/stable/release/migration_guide_from_l.x^to_2.0.html 
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• The order of parameters changed for the functions nx.set edge attributesO 
and nx.set_node_attributes(). The list of attribute values is now the seeond 
parameter, and the attribute name is the third parameter. 

• Method G.seIfloopO beeame function nx.seIfloopO. 

Deprecation 

• Function nx.blockmodelO deprecated in favor of nx.quotient_graph() with 
relabel=True. 

• Functions nx.from_pandas_dataframe() and nx.to_pandas_dataframe() deprecated 
m favor of nx.to_pandas_adjacency(), nx.from_pandas_adjacency(), nx.to_pandas_edge- 
listO, and nx.from_pandas_edgelist(). 
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SYMBOLS 

[] (selectlon operator), 23 

A_ 

A4 page format, 40 
Abraham Lincoln timellne 
converttng adiacency ma¬ 
trix, 70-75 
with graph-tool, 211 
with iGraph, 210 
moving data with edge 
Usts and node dlctlonar- 
les, 76-77 
with NetworKit, 212 
with networkx, 212 
with pure Python, 209 
shclng, 80 
vlsualizatlons, 4 
actors, see nodes 
acycllc graphs, dlrected, 203- 
208 

Adamlc, Lada, 55 

add edgeO, 19, 23 

add edges fromO, 19, 23, 71 

add nodeO, 19, 23 

add nodes fromO, 19, 23 

add_weighted_edges_from(), 24 

adjacency 

bipartite networks, 180- 
181, 187 

chque communltles, 134 
creatlng networks from 
adiacency matrices, 
69-75 
defined, 17 

Importmg and exportlng 
adjacency Usts, 30 


adjectives for game developers 
example, 204-208 
algorithm.bipartite module, 177 
Allce, 62 
all_neighbors(), 85 
alter nodes, 54—57, 84-86 
Amazon’s Mechanical Turk, 
139 

Anaconda, xv, 136 
angular dlstance and coslne 
dlstance, 172 

Animal SociaL Networks, 53 
antl-communitles, 136 
antl-parallel dlrected edges, 
197 

assortatlve mixlng, 97, 105 
assortativity 

coefficlent, 100, 105 
cosmetlcs case study, 
156 

defined, 97 
estlmatlng unlformlty, 
97-100 

Panama Papers case 
study, 105-107 
m soclal network analy- 
sls, 6 
asymmetry 

dlrected graphs, 197-199 
examples, 197 
famlly trees, 3 
slgned networks, 199 
tlmelmes, 3 

attractmg components, 202 
attracting_components(), 202 
attracting_components_subgraphs(), 
202 
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attribute_assoitativity_coefficient(), 
100, 105 

attribute mixing matrixO, 99, 105 
attributes, see also slmilarity 
Abraham Lincoln time- 
line, 210 
addlng, 23-25 
assortativity, 97-100, 
105-107 
bmarizmg, 166 
changlng, 24 
co-occurrence, 115 
contractlons, 45 
defined, 2 

edlting appearance, 31 
m graph-tool, 15 
Hamming dlstance, 168 
handllng with Pandas, 74 
Manhattan dlstance, 169 
networkx 2.0, 213 
normallzlng for Manhat¬ 
tan distance, 170 
Processing selectlvely, 
109 

removlng, 24 
selectlng incident edges, 
23 

storage in networkx, 20 
authorities, 35, 95-96 
average clusteringO, 88 
average degree connectivityO, 98 

B 

balance theoiy, 57 
balanced_tree(), 78 
barabasi_albert_graph(), 78 
Barabasi, Albert-Laszlo, 129 
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Barabasl-Albert graphs 
about, 65 

generatlng synthetic net- 
works, 78 

Panama Papers case 
study, 106 
Bames, John, 6 
BeautifulSoup, 154 
best-partition(), 191 
betweenness centrallty 

bipartite networks, 178 
defined, 93 
directed graphs, 201 
ego networks, 57 
measurlng, 35, 96, 201 
Othello semantic network, 
119 

product networks, 122 
soclal networks, 57 
betweenness centralityO, 96, 178 
BFS (breadth-first search), 43 
bi-adjacency matrix, 180, 187 
bmomlal graphs, 

see Erdos-Renyi graphs 
hiologlcal networks, exam- 
ples, 5 

hipartite networks, 175-183 
antl-communities, 136 
checking for, 177, 187 
defined, 175 
dlsadvantages, 176 
examples, 176 
functlons, 177-178 
networkx 2.0, 213 
projectlng, 178-183, 190 
sketchlng by hand exam- 
ple, 7 

trauma types case study, 
185-192 

bipartite.setsO, 177, 213 
blockmodelO, 138 
blockmodellng 

core-perlpheral analysls, 
129 

cosmetlcs case study, 
157 

defined, 138 
deprecatlon in networkx 
2.0, 214 

wlth graph-tool, 14 
namlng extracted blocks, 
139 

performlng, 138 
term, 138 
Bob, 62 


bottom nodes, bipartite net¬ 
works, 178-183 
branchlng factor, 78 
Brandes, Ulrlk, 95 
breadth-first search (BFS), 43 
brldges, 66, 94 
broadcasting and communlca- 
tlon networks, 62 
brokerage, 57, 93 
Building Mini-Categories in 
Product Networks, 122 

c 

c 

graph-tool, 13, 16 
iGraph support, 12, 16 
Umltations for CNA, xlil 
C-H-H, 13, 16 

cachlng, data for cultural do- 
main analysls, 143 
case, lowerlng, 145 
case studles, see al¬ 
so Wiklpedia pages case 
study 

cosmetlc products, 153- 
160 

cultural domam analysls, 
141-151 

Panama Papers, 101- 
111, 176 

trauma types network, 
185-192 

CDA, see cultural domaln 
analysls 

center, measurlng, 91, 201 
centerO method, 91 
centrallty, see also between¬ 
ness centraUty; eccentrlclty; 
elgenvector centrallty 
closeness centrallty, 35, 
93, 96, 178, 201 
degree centrallty, 35, 92, 
201 

directed graphs, 201 
harmonlc centrallty, 93, 
96 

HITS (Hyperlmk-lnduced 
Toplc Search), 35, 95- 
96 

measurlng, 32, 35, 90, 
92-97, 201 

measurlng correlatlon of, 
96 

Othello semantic network, 
119 


PageRank, 35, 94-96 
soclal network examples, 
57 

chain.fromJterableO, 155 
Chat sessions, recording, 63 
checkerhoard networks, 
see grlds 
Chuck, 62 

clrcles, networks as, 91 
circular layout, 26-28 
circularJayoutO, 26 
cityblockO, 170 

classlc networks, see also sim¬ 
ple networks 

generatlng synthetic net¬ 
works, 78 
types, 2-4, 63 
clearO, 20 

cUque percolation, 134 
cllque-node, 133 
eliques 

hipartite networks, 178 
cllque-node, 133 
defined, 131 
extractmg, 131-134 
maxlmal, 132-133, 138 
maximum, 132 
percolation, 134 
recognlzmg cUque commu- 
nltles, 134-135 
eliques, soclal, 66 
closeness centrallty 

hipartite networks, 178 
defined, 93 
directed graphs, 201 
measurlng, 35, 96, 201 
closeness centralityO, 96, 178 
clustering, see clustering coef- 
ficlents; communlty detec- 
tlon; modularity 
clustering coefflclents 

hipartite networks, 178 
clustering as term, 88 
clusters as term, 134 
measurlng, 32, 35, 87 
clusteringO, 87, 178 
CNA, see complex network 
analysls 
co-occurrence 

component analysls, 128 
cultural domaln analysls 
case study, 147 
defined, 115 

definlng for analysls, 119 
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product networks, 120- 
123 

semantlc networks, 116- 
120 

code for this book 

cosmetics case study da¬ 
ta, 154 

cultural domain analysls 
case study data, 141 
dzcnapy plotiib module, 27 
mlgratlon dlstributlon 
data, 199 

modules needed, xv 
notatlon, xvi 
onUne files, xvil 
trauma types case study 
data, 186 

cognitive balance, see ho- 
mophlly 

coincidence, see co-occur- 
rence 

collections, cosmetics case 
study, 153-160 
communlcatlon networks, 
understanding, 61 
communitles, see also eliques 
antl-communitles, 136 
blockmodellng, 138 
communlty detectlon, 

12, 14, 16 

cosmetics case study, 
157 

cultural domain analysls 
case study, 148-150 
defined, 35 

modularlty-based, 136- 
138, 191 

partltionlng mto, 35, 88, 
148-150, 157, 191 
soclal network examples, 
57 

term, 134 

trauma types case study, 
191 

communlty 

cultural domain analysls 
case study, 148-150 
managmg modularlty- 
based communitles, 
136 

trauma types case study, 
191 

verslon, xv 
communlty detectlon 
graph-tool, 14, 16 
IGraph, 12, 16 
networkx, 12, 16 


complements In product net¬ 
works, deflned, 120 
complete graphs 
about, 64 

clusterlng coefflclent, 87 
generatlng, 78 
complete subgraphs, 
see eliques 
complete graphO, 78 
complex contaglon, 57 
complex network analysls, 
see also case studles; net¬ 
works 

bipartite networks, 178 
cautlons about bulldlng 
own modules, 11 
deflned, xill, 1 
with Gephi, 32, 34-37 
hlstory, 1, 5 
as Iterative process, 17 
use m medlcal and pubUc 
health, 185 

complex networks, see al¬ 
so eliques; networks; slml- 
larlty 
deflned, 4 
major classes, 5 
separating cores, shells, 
coronas, and cmsts, 
129-131 

sketchmg by hand, 6-8 
spllttmg Into cormected 
components, 126-128 
structural elements, 125- 
139 

synthetlc networks, 78 
components 

attractlng components, 
202 

condensatlon, 202 
cosmetics case study, 
155-160 
detectlon, 90 
dlrected acycUc graphs, 
203 

dlrected graphs, 201 
fUterlng, 104 
giant connected compo¬ 
nent (GCC), 125, 128- 
129, 155 

measuring coimectedness 
wlth Gephi, 32, 35 
namlng, 157-160 
Panama Papers case 
study, 104 


reversal, 202 
splltting networks Into, 
126-128 

components, connected 

compared to cUques, 131 
cosmetics case study, 
155 

defined, 126 
flltering, 104 
separating cores, shells, 
coronas, and cmsts, 
129-131 

splltting networks Into, 
126-128 

Conceptual Structure oJFraud 
Research and Its Dynamics, 
116 

condensatlon, 202 
condensationO method, 202 
connected_component_subgraphs(), 
104, 128 

connected_components(), 127, 155 
connected_watts_strogatz_graph(), 
78 

connectedness, see also assor- 
tativlty 

measuring, 32, 35 
Panama Papers case 
study, 104 

splltting networks Into 
connected components, 
126-128 

consequences, reversal, 202 
contaglon, xv, 57 
Continuum Analytlcs, xv 
contracted nodesO, 45 
contractlons, 45 
core 

deflned, 129-130 
main, 130 
separating, 129-130 
core nodes, collectlng, 47 
core-perlpheral analysls, 129 
corona 

deflned, 130 
separating, 129-130 
corrO, 96, 174 

correlatlon, see Pearson corre- 
latlon 

cosine simUarlty, 171-173, 
187-192 
cosineO, 172 

cosmetlc products case study, 
153-160 
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CounterO, 158 

coupllng, produci networks, 
122 
cmst 

defined, 130 
separatlng, 129-130 
CSV files 

adjectlves for game devel- 
opers example, 205 
cosmetics case study, 
153-160 

dlctionary reader, 103 
food and nutrient exam¬ 
ple, 21 

Panama Papers case 
study conversion, 101- 
111 

trauma types case study, 
186 

cultural domaln analysls 
case study, 141-151 
defined, 141 

cultural networks, examples, 
5 

cycle networks, see rlngs 
cycle graphO, 78 
cycles, tralls as, 89 

D 

DAGs, seedlrected acyclic 
graphs 

data alignment, 146, 165 
Data Laboratory tab in Gephi, 
32-33 

Data Science Essentials in 
Python, 73 
DataFrame 

converting adjacency ma¬ 
trices, 73-75 
cultural domaln analysls 
case study, 144, 146, 
150 

data alignment, 146, 165 
defined, 73 
Importlng node at- 
trlbutes, 74 
namlng communltles 
helper, 150 
networkx 2.0, 214 
parslng Panama Papers 
CSV file, 108-111 
parslng Irregular texts, 
145 

Pearson correlatlon, 174, 
180 


simUarlty, 164 
as term vector model, 146 
Davis Southern women syn- 
thetlc network 
about, 65 

bipartite networks, 176 
generatlng, 79, 164 
simpliclty in, 164-166 
Davis, AUlson, 164 
davis_southern_women_graph(), 79, 
164 

degree, see also degree cen- 
traUty; indegree; outdegree 
assortatlvlty, 98, 106 
bipartite networks, 176, 
178 

defined, 35 
dlrected graphs, 199 
glant connected compo¬ 
nent (GCC), 129 
Induced nodes, 159 
networkx 2.0, 213 
Panama Papers case 
study, 105-107 
in soclal network analy¬ 
sls, 6 

soclal network examples, 
57 

Wlkipedla pages case 
study, 46 

degree assortatlvlty coeffi- 
clent, 106 
degree attribute, 213 
degree centrallty 

bipartite networks, 178 
defined, 92 

measurlng, 35, 92, 96, 
201 

degreeO method, 47, 199 
degree(induced) method, 159 
degreeassortativitycoefficientO, 
105 

degree centralityO, 92, 96, 178, 
201 

degreesO, 178 
DegreeVlew, 213 
deletlng 

all nodes and edges whOe 
keeplng shell, 20 
attrlbutes, 24 
dupllcates, 45, 110 
Isolates, 126 
nodes and edges wlth 
Gephi, 31 

nodes and edges wlth net¬ 
workx, 19, 45, 110 


nodes from ego networks, 
55 

self-loops, 22, 45 
denslty 

bipartite networks, 178, 
189 

converting sparse matrix 
to dense, 76 
defined, 84 
measurlng, 35, 84 
densityO method, 84, 178 
describeclusterO, 150 
deseriallzatlon, wlth pickle, 
143, see also seriallzatlon 
diameter 

defined, 91 
dlrected graphs, 201 
measurlng, 35, 91, 201 
diameterO method, 91 
dlchotomlzatlon, 166 
dlctlonaries 

attrlbutes, 23 
converting welght to, 147 
CSV dlctionary reader, 
103 

of dlctlonaries, 77 
ofllsts, 77 

measurlng length, 21 
moving data wlth node, 
76-77 

node and edge storage, 
20 

relabelmg nodes, 22 
to mitigate Issues wlth 
edge llsts, 210 
dlgital humanltles, 118 
DiGraphO, 18 

dlgraphs, seedlrected graphs 
dlrected acyclic graphs, 203- 
208 

dlrected edges 
about, 18 
antl-parallel, 197 
asymmetric relatlonshlps, 
197-199 
defined, 197 
networkx vlsuallzatlon, xvl 
soclal networks, 53 
dlrected graphs, 197-208 
asymmetric relatlonshlps, 
197-199 

cllque strength, 131 
clustering coefficlent, 87 
condensatlon, 202 
connected components, 
126 
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convertlng to undirected, 
18, 128, 202 
creatlng, 18 
defined, 18, 197 
degree, 199 
density, 84 

dlrected acycllc graphs, 
203-208 

dlrected multlgraphs, 19 
eccentriclty, 91, 201 
flattenmg, 202 
measurement, 199-208 
networkx 2.0, 213 
PageRank, 95 
reversal, 202, 213 
vs. signed networks, 199 
soclal networks, 53 
vlsualizatlons with Gephi, 
199 

weight, 199 
Wlkipedla pages case 
study, 43 

dlrected multlgraphs, 19 
dlrected network analysls, 
game developer example, 
204-208 
dlscreteness, 2 
dlssortatlve networks, 97 
dlstance, see also slmllarlty 
angular dlstance, 172 
bipartite networks, 180- 
181 

coslne dlstance, 171- 
173, 187-192 
dlrected graphs, 201 
Euclldean dlstance, 171 
Hamming dlstance, 167- 
169, 180, 187-192 
Manhattan dlstance, 
169-171 

understandmg, 163, 167- 
173 

documentatlon and communl- 
catlon networks, 63 
draw circularO, 26 
draw networkxO, 26-28, 158 
draw randomO, 26 
draw_shell(), 26 
draw spectralO, 26 
draw springO, 26 
dumpO, 143 

Dunbar’s number, 57 
duplicate nodes 

deletlng, 45, 110 
detectlng, 33 
merglng, 33, 108 


dyads 

eliques, 132 
defined, 6 

nelghborhoods as, 86 
dynamlcs, xv, 57 
dzcnapy plotiib module, 27 


eccentriclty, 35, 91, 201 
eccentricityO method, 91 
ecologlcal networks, exam- 
ples, 5 

economlc networks, exam- 
ples, 5 
edge 

accesslng dlrectly, 77 
deflnlng or changlng at- 
trlbutes, 24 
storlng nodes wlth, 20 
edge llsts 

food and nutrient exam¬ 
ple, 22 

support for, 30 
uslng, 20, 76, 209 
edges, see also adjacency; at- 
trlbutes; centraUty; commu- 
nltles; dlrected edges; edge 
llsts; Incident edges; labeis; 
preferential attachment; 
welght 

addlng duplicate, 19 
addlng or removmg wlth 
Gephi, 31, 33 

addlng or removmg with 
graph-tool, 14 

addlng or removmg with 
iGraph, 13 

addlng or removmg with 
networkx, 19, 23, 46, 103 
assortatlvlty, 97-100 
Barabasl-Albert graphs, 
65, 79 

In classlc networks, 2-4 
co-occurrence, 115, 119 
communlcation networks, 
61 

convertlng slmllarltles to, 
163 

core-perlpheral analysls, 
130 

defined, 2 
density, 84 

dlrected graphs, 18, 199 
dlstmguishtng strong and 
weak In soclal net¬ 
works, 66 


Erdos-Renyl graphs, 64, 
79 

event networks, 164 
gatherlng for Wlkipedla 
pages case study, 41- 
44 

Holme-Klm graphs, 65, 
79 

incldence matrices, 76 
induced, 137 
isolates, 125 
measurlng network slze, 
83 

merglng dupllcates, 45 
negatlvely welghted, 116, 
120, 126 
networkx 2.0, 213 
non-exlstent, 84, 134 
parallel edges and multl¬ 
graphs, 18-19 
path length, 89 
product networks, 120- 
123 

reverslng, 202 
selectlng, 23 
self-loops, 18, 33, 45, 89 
semantlc networks, 116- 
117, 119 
signed, 199 

soclal networks, 53, 57- 
60, 66 

storlng m networkx, 20 
synthetlc networks, 63- 
66, 78 
tralls, 89 

truncatlng networks, 46 
undirected graphs, 18 
walks, 89 

Watts-Strogatz graphs, 
64, 79 

edges attribute, 213 
edgesO method, 20, 77, 84, 
213 

EdgeView, 213 
ego networks 

clusterlng coefficlent, 87 
defined, 54 

Facebook example, 55-57 
nelghborhood, 84-86 
Panama Papers case 
study, 109 

understandmg, 54-57 
ego nodes, 54-57, 84-87, 89 
ego_graph(), 86 

egocentrlc networks, see ego 
networks 
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eigenvector centrallty 
defined, 94 
dlrected graphs, 201 
measuring, 35, 96, 201 
Othello semantic network, 
119 

produci networks, 122 
soclal networks, 57 
spectral layout, 26 
eigenvector_centrality(), 96 
email as communicatlon net¬ 
work, 62 

emplric networks, 61 
empty graphs, 18 
Enron, 62 
entltles, see nodes 
erdos renyi graphO, 78 
Erdos-Renyl graphs 
about, 64 

generatlng synthetic net¬ 
works, 78 
sUced example, 80 
ERGMs (exponential random 
graph models), xv, 6 
Euclidean distance, 171 
event networks, example of 
slmllarlty, 164-166 
evolutlon, xv, 6 
Exploratory Social NetWork 
Analysis with Pajek, 58 
exponential random graph 
models (ERGMs), xv, 6 
exporting, networks, 30, 32, 
39 

F 

Facebook 

ego network, 55-57 
emplric networks, 61 
medlan number of 
frlends, 55 

slze of soclal network, 57 
as soclal networklng 
webslte, 54 

famlly networks, see al¬ 
so Abraham Lincoln tune- 
Ime 

dlrected graphs, 18 
Florentlne famllles syn¬ 
thetic network, 65, 79 
famlly trees, defined, 3 
Federal Energy Regulatory 
Commlsslon, 62 


files, see also CSV files 

formats for savlng vlsual- 
Izations, 32, 39 
Importmg and exporting 
networks, 30 
fllterlng 

components, 104 
wlth Gephi, 32 
with graph-tool, 14 
■findO, 13 

findcIiquesO, 132, 134 
flattenlng, 202 
Florentlne famllles synthetic 
network, 65, 79 
florentine families graphO, 79 
flows, XV, 198 

food and nutrient examples 
addlng and removlng 
nodes and edges, 19- 
23 

addlng attrlbutes, 23-25 
bipartite networks, 175- 
183 

buUdlng programmatlcal- 
ly, 19-30 
with Gephi, 32-40 
readlng CSV file, 21 
savlng network, 29-30 
sketchmg by hand, 6-8 
visuallzatlon with Gephi, 
37-40 

visuallzatlon with mat- 
plotlib, 25-28 

food fraud semantic network 
example, 116-118 
food pantry examples 

binarlzlng attrlbutes, 166 
as product network, 120 
distance, 167-171 
slmllarlty, 166-171 
force-directed layout, 

see Fmchterman-Relngold 
layout 

ForceAtlas2 layout, 37 
friendshlp paradox, 57 
from_dict_of_lists(), 77 
from edgelistO, 76 
from numpy matrixO, 72-73 
from pandas adjacencyO, 214 
from pandas dataframeO, 73, 180, 
214 

from pandas edgelistO, 214 
frozensetO, 135 


Fmchterman-Relngold layout, 
26-28, 37 

fruchternnan_reingoldjayout(), 26 

G 

game developer language 
toposort example, 204-208 
GCC (glant connected compo¬ 
nent), 125, 128-129, 155 
generalized module, xv, 182 
generallzed slmllarlty, 174, 
181, 187-192 

generalizedsimilarity module, 182 
geodeslcs 

betweenness centrallty, 
93 

closeness centrallty, 93 
defined, 90 
eccentriclty, 91 
Gephi 

about, xlv, 31 
arrows In, xvl 
capabllltles, 31 
creating networks, 31 
cultural domain analysis, 
148 

dlrected graphs, 199 
Installing, 31 
mtegrating with networkx, 
40 

llmltations, 32 
main wlndow and tabs, 
32 

measuring connected- 
ness, 32, 35 
network analysis with, 
32, 34-37 

Panama Papers case 
study, 110 

passlng and savlng net¬ 
works, 29, 31-33, 39 
savlng vlsuallzatlons, 32, 
38-39 

uslng, 31-40 
GetNet, 55 

glant connected component 
(GCC), 125, 128-129, 155 
The Good Wife case study, 
141-151, 176 

Google+, emplric network, 61 
Granovetter, Mark, 66 
graph exchange XML format, 
Importmg and exporting, 30 
graph modellng language, 
Importlng and exporting, 30 
GraphO, 18 
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graph-tool, 11-13, 16, 211 
GraphML 

dlctionary attributes, 45 
Importlng and exporting, 
30, 40 

graphs, see also dlrected 
graphs; Induced graphs; 
multlgraphs; networks; 
undirected graphs 
defined, 2 
empty, 18 

pseudographs, 18, 53 
relabellng nodes, 22 
simple network example, 
18 

types, 18 
graphviz 

about, 28 

cosmetics case study, 
158 

food and nutrient exam¬ 
ple, 28 

with graph-tool, 14 
InstalUng, xvl 
node placement In net- 
workx, 13 
graphviz-dev, xvl 
gricl_2cl_graph(), 78 
grlds 

defined, 3 

synthetlc networks, 64, 
78 

groupbyO, 157 
groups, see modularlty 

H 

Hamming distance, 167-169, 
180, 187-192 
hammingO, 168 

harmonic centraUty, 93, 96 
harmonic_centrality(), 96 
helght, specUylng for trees, 
78 

HelghtsWelghts dataset, 170 
hlerarchlcal networks 
defined, 4 

dlrected acycllc graphs, 
203 

in soclal network analy- 
sls, 6 

HITS (Hyperllnk-Induced 
Toplc Search), 35, 95-96 
hitsO, 96 

Holme-Klm graphs, 65, 78 
homophOy, 57, 98 


hubs, 35, 95-96 
hypergraphs, 2 
Hyperllnk-Induced Toplc 
Search (HITS), 35, 95-96 
hypemyms, 198 
hyponyms, 198 

I 

lago, 119 

Icons, changmg slze and color 
with Gephi, 31 

IGraph, 11-12, 15, 137, 210 
Importlng 

location for Imports, 102 
networks, 30-33 
node attributes Into 
DataFrame, 74 
in degree attribute, 213 
in degreeO method, 48, 199 
in degree centralityO, 201 
Incldence matrices, creatlng 
networks from, 69, 76 
incidence matrixO, 76 
incident edges 
defined, 17 
dlrected graphs, 199 
removlng nodes and, 19 
selectlng, 23 
Incident nodes 

addlng edges with net- 
workx, 103 
defined, 17 

Incldence matrices, 76 
Indegree 

adjacency matrices, 70 
centrallty, 92, 201 
defined, 46 
dlrected graphs, 199 
networkx 2.0, 213 
reverslng, 202 
Wlkipedla pages case 
study, 46 
Indexes 

contlguous Indexes m 
graph-tool, 14 
Series Index, 146 
Induced edges, 137 
Induced graphs 

bipartite networks, 178 
condensatlon, 202 
creatlng, 137 
self-loops, 157 
Induced nodes, degree, 159 
inducedecIgesO, 137 


inducedgraphO, 137, 157 
Influence, 57 

Information, see knowledge 
iPython, 71 

isbipartiteO, 177, 187 
is directed acyclic graphO, 203 
isJsomorphicO, 77 
Isolates 

defined, 125 
deletlng, 126 
locatlng, 125 
problems representlng In 
pure Python, 210 
product networks, 122, 
125 

isolatesO method, 126 
Isomorphlsm, edge llsts, 77 
itemgetter, 42, 157 
itemsO, 157 
itertools, 153, 157 

J 

Justlce Resource Institute, 
186 

K 

k-cllque, see eliques 
k-core, see core 
k-corona, see corona 
k-crust, see crust 
k-partlte networks, see bipar¬ 
tite networks; multl-partite 
networks 
k-sheU, see shell 
kcliquecomrrunitiesO, 135 
kcoreO, 131 
kcoronaO, 131 
kerustO, 131 
kshellO, 131 

Karate Club synthetlc net¬ 
work, 65, 79 
karate club graphO, 79 
KeyError, 24 
knowledge 

Information dlssemlna- 
tion/dlffuslon, 57, 62 
preservation and soclal 
networks, 57, 66 
representatlon, 116 
Koblenz Network CoUeetion 
(KONECT), 61 
Kovaes algorlthm, 181 
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Krevl, AndreJ, 61 
Kunegis, Jerome, 61 

L 

labeis 

Abraham Lincoln time- 
llne, 210 

auto-generatlng commu- 
nlty labeis, 157-160 
editlng wlth Gephi, 31, 33, 
37 

enabling m Gephi, 33 
food and nutrient exam- 
ple, 24, 28 
in graph-tool, 15 
in graphviz, 28 
Issues with pure Python, 
210 

matplotiib and, 28 
namlng extracted blocks, 
139 

in networkx, 18 
Processing attributes se- 
lectively, 109 
relabeling nodes, 22, 103 
layout 

cosmetlcs case study, 
158 

Gephi, 32, 37-40 
graphviz, 28, 158 
networkx, 26-28 
lemmatizatlon, 144—145 
ienO, 21, 83, 85 
Leskovec, Jure, 61 
letter format, 40 
linear networks, see paths 
Itnks and asymmetrlc relation- 
shlps, 198 

Usts, see cdso edge Usts 
eliques, 132 
connected components, 
127 

empty Ust accumulator, 
145 

importlng and exportlng, 
30 

Ust of Usts, converting to, 
72 

Ust of Usts, to represent 
matrices, 70 
measurlng length, 21 
nelghborhood measure- 
ment, 85 

node dlctlonarles, 76-77 
removing non-exlstent 
nodes or edges, 19 
speed, 25 


term Usts in cultural do- 
maln analysls, 142, 
144 

term vectors, 145 
Live Journal 
about, 142 

case study, 141-151, 176 
emplric network, 61 
local brldges, 94 
local topology, examples, 57 
logO, 106 

long tali in power law distrlbu- 
tlon, 107 

louvaln algorlthm, 136-137 
lowerO, 145 

M 

machine leamlng and namlng 
extracted blocks, 139 
maln core, 130 
makemaxciiquegraphO, 133 
Manhattan dlstance, 169-171 
marketing analysls, 122 
Markov chalns, 95 
Mathematica, xill 
matpiotiib 

ahout, 71 

food and nutrient exam- 
ple, 25-28 
Importmg, 26 
mtegration wlth networkx, 
25 

Panama Papers case 
study, 101 

plottlng centraUtles, 97 
version, xv 
matrices 

attribute mlxlng, 105- 
107 

bl-adjacency matrix, 

180, 187 

creatmg networks from 
adjacency and Incl- 
dence matrices, 69-76 
dense, 76 

event networks, 166 
Ust of Usts to represent, 
70 

Ust of Usts, converting to, 
72 

matrtx-multiplying, 147 
simllarlty matrix, 188 
sparse, 76, 147 
term matrix in cultural 
domaln analysls, 144 
unit, 96 


matrtx-multiplying, 147 
maximal cUques, 132-133, 
138 

maximum eliques, 132 
mean reclprocal dlstance and 
closeness centraUty, 93 
measurement, 83-100, see 
also degree; dlstance 
centraUty, 32, 35, 90, 92- 
97, 201 

clustering coefflclents, 
32, 35, 87 

degree centraUty, 35, 92, 
96, 201 

dlrected graphs, 199-208 
eccentriclty, 35, 91, 201 
estrmating unlformlty 
through assortatlvity, 
97-100 

wlth Gephi, 32, 34—37 
global measures, 83 
nelghborhoods, 84-88 
paths, 32, 35, 88-92, 131 
slmUarity, 163-164, 167- 
173 

slze, 83 

Wiklpedla pages case 
study, 46, 83-100 
Measurlng Tie Strength in Im- 
plieit Social NetWork, 118 
membership and asymmetrlc 
relatlonshlps, 198 
memory 

graph-tool, 14 
mcldence matrices, 76 
measurlng non-exlstent 
edges, 84 

merglng, dupUeate nodes, 33, 
108 

meshes, see grids 
mlgration dlstributlon exam- 
ple of dlrected graphs, 199- 
204 

modularity 

cosmetlcs case study, 
157 

cultural domaln analysls 
case study, 148 
defined, 136 
measurlng, 32, 35 
modularity classes, 35 
outllnlng modularity- 
based communitles, 
136-138 
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partitioning networks In- 
to communities, 35 
trauma types case study, 
191 

modularityO method, 137 
modules 

InstalUng, xvl 
verslons used In thls 
book, XV 
MoiKrug, 60 
monads, eliques, 132 
Moreno, J.L., 5 
Mossack Fonseca, 101 
mostcommonO, 158 
multl-partlte networks 
about, 176 

antl-communltles, 136 
examples, 176 
MultiDiGraphO, 19 
MultiGraphO, 18 
multlgraphs 

adding duplicate nodes 
or edges, 19 
adjacency matrices, 69 
creatlng, 18 
defined, 18 
dlrected, 19 
soclal networks, 53 
MySpace, 55 

N 

names 

cosmeties case study, 
157-160 

extracted blocks, 139 
merglng duplicate, 108 
name generators, 54 
term communities, 148- 
150 

NaN (not a number), 75, 109, 
165 

natural language processlng 
cultural domaln analysls 
case study, 142-151 
dlstlnguishlng strong and 
weak edges in soclal 
networks, 66 
lemmatlzlng, 144-145 
stemmlng, 145 
stop words, 42, 144-145 
toposort example, 204—208 
Zlpfslaw, 147 
Natural Language Toolklt 
cultural domaln analysls 
case study, 142-151 
verslon, xv 


nelghborhoods 

assortatlvlty, 98 
defined, 84 
dlrected graphs, 200 
as dyadlc, 86 
measurement, 84-88 
In networkx 2.0, 213 
open, 85 
path length, 89 
transitive closure, 5 
neighborsO method, 200, 213 
NetWork Analysis, 64 
NetWork Anedysis oJExposure 
to Trauma and Adverse 
Events in a Clinical Sample 
of Children and Adoles- 
cents, 185 

network dynamlcs, xv 
network flows, xv 
NetWork Science, 185 
NetworKit, 11-13, 15, 212 
networks, see also assortatlv- 
ity; bipartite networks; 
classlc networks; clustertng 
coefflclent; complex net¬ 
works; eccentriclty; ego 
networks; graphs; measure¬ 
ment; nelghborhoods; se- 
mantlc networks; slgned 
networks; slmilarlty; simple 
networks; soclal networks; 
synthetlc networks 
as clrcles, 91 
communlcatlon, 61 
creatmg from adjacency 
and mcldence matrices, 
69-76 

creatmg wlth Gephi, 31 
defined, 2-4 
dlssortatlve, 97 
edltlng wlth Gephi, 31 
emplrlc, 61 

event networks, 164-166 
flows, 198 

hlerarchlcal, 4, 6, 203 
Importmg and exportlng, 
30-33, 39 
savlng, 29-30 
scale-free, 6 
separating cores, shells, 
coronas, and cmsts, 
129-131 

small-world, 6, 57, 64, 78 
spllttmg Into cormected 
components, 126-128 


structural elements, 125- 
139 

truncatlng, 46 
Networks, Crowds, and Mar- 
kets, 185 
networkx 

about, xlll, 1,11 
Abraham Lincoln tlmeltne 
example, 212 
adding attributes, 23-25 
adding or removlng nodes 
and edges, 19, 23, 46, 
103 

advantages, 15 
bipartite networks, 177- 
178, 180 

bulldmg food and nutri¬ 
ent example program- 
matlcally, 19-30 
centraUty measurement, 
92-97 

eliques, 132-135 
clusterlng coefflclent 
measurement, 87 
communlty detection, 12 
component analysls, 127 
convertlng Panama Pa- 
pers CSV file wlth Pan¬ 
das, 108-111 
convertlng Panama Pa- 
pers CSV file wlthout 
Pandas, 101-107 
convertlng adjacency ma¬ 
trices, 71-75 
core-perlpheral analysls, 
131 

cosmetlcs case study, 
153-160 

cultural domaln analysls 
case study, 147 
denslty measurement, 84 
dlrected acycllc graphs, 
203-208 

dlrected graphs, 199-208 
eccentriclty measure- 
ments, 91 

edge Hsts and node dlctlo- 
narles, 76-77 
estlmatlng unlformlty 
through assortatlvlty, 
97-100 

generatlng synthetlc net¬ 
works, 78 

global measures, 83 
graph creation, 18 
graphviz and node place- 
ment, 13 
Importing, 17 
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Importing and exportlng 
networks, 30 
Integration with Gephi, 40 
Integration with matplotiib, 
25 

Isolates, 126 
layout options, 26-28 
modularity-based commu- 
nltles, 136-138 
neighborhood measure- 
ment, 85 

node and edge Usts, 20 
path measurements, 88- 
92 

reading CSV file, 21 
relabeling nodes, 22 
resources on, 12, 15, 213 
saving and sharing net¬ 
works, 29-30 
saving visualizations, 26 
slmilarity, 164 
simple network with, 17- 
30 

slze measurement, 83 
sllclng weighted net¬ 
works, 79-81 
speed, 13, 15 
stubs for dlrected edges 
vlsuallzatlon, xvl 
verslon 1.11, xv 
verslon 2.0, 208, 213 
vlsualizatlon slze llmlta- 
tlons, 110 

vlsualization with mat¬ 
plotiib, 25-28 
welght assumptions, 71 
NetworkX Google dlscusslon 
group, xvll 
NetworkXError, 19 
Newman’s definltion, 136 
nextO, 90, 155 
nitk 

cultural domaln analysls 
case study, 142-151 
verslon, xv 

node 

definlng or changmg at- 
trlbutes, 24 
degree, 35 

storing nodes with, 20 
node dlctlonarles 
CSVlookup, 103 
movlng data with, 76-77 


nodes, see also adjacency; 
attrlbutes; centrallty; 
eliques; communltles; de¬ 
gree; incident nodes; Iso¬ 
lates; labeis; nelghbor- 
hoods; preferential attach- 
ment; snowballlng 
addlng duplicate, 19 
addlng or removlng with 
Gephi, 31, 33 

addlng or removlng with 
graph-tool, 14 

addlng or removlng with 
IGraph, 13 

addlng or removlng with 
networkx, 19, 23, 46, 
103, 110 

alter nodes, 54—57, 84—86 
assortatMty, 97-100 
avoldlng merglng, 158 
Barabasl-Albert graphs, 
65, 79 

bipartite networks, 177- 
183 

in classlc networks, 2-4 
core, 47 

core-perlpheral analysls, 
130 

deflned, 2 
detectlng duplicate 
nodes, 33 

dlrected graphs, 199 
dlscoverlng new nodes in 
ego networks, 55 
dlscreteness, 2 
dyadlc, 6 

edltlng in Gephi, 33-37 
ego nodes, 54—57, 84—87, 
89 

Erdos-Renyl graphs, 64, 
79 

gathering for Wlklpedla 
pages case study, 41- 
44 

Holme-Klm graphs, 65, 
79 

Identlfying, 7 
Induced, 159 
measurlng network slze, 
83 

measurlng number of, 21 
merglng duplicate, 33, 
108 

namlng in cosmetlcs case 
study, 157-160 
networkx 2.0, 213 
path length, 89 
product networks, 120- 
123 


random node sampllng, 

59 

reflexlve, 2 

removlng duplicate, 110 
removlng from ego net¬ 
works, 55 

seed, 7, 41, 59, 109 
self-loops, 18 
semantlc networks, 117, 
119 

slmllarlty-based net¬ 
works, 163-174 
soclal networks, 53, 57- 

60 

source nodes, 69, 90 
splltting in bipartite net¬ 
works, 177 
storing in networkx, 20 
supemodes, 138 
synthetlc, 133, 137-138 
synthetlc networks, 61, 
63-66, 78 

target nodes, 69, 90 
transitive closure, 5 
triadlc, 6 

truncatlng networks, 46 
Watts-Strogatz graphs, 
64, 79 

nodes attribute, 213 
nodesO method, 20, 213 
NodeView, 213 

non-exlstent edges, 84, 134 
non edgesO method, 84 
number of edgesO, 83 
number of nodesO, 83 
NumPy 

about, 71 
assortativlty, 99 
convertlng adjacency ma¬ 
trices, 71 

cultural domain analysls 
case study, 142-151 
generatlng unit matrix, 
96 

Hamming dlstance conver- 
slon, 169 

random layout and, 26 
verslon, xv 

nutrient examples, see food 
and nutrient examples 

o 

Odnoklassnlkl, 59, 97-100 
Oh No They Dldrit! (blog), 142 
open nelghborhoods, 85 
OpenMP, 13, 16 
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orderlng 

asymmetric relationships, 
198 

lookup, 43 
in networkx 2.0, 214 
selecting attrlbutes by 
area of Interest, 109 
Othello, 118 
out degree attribute, 213 
out degreeO method, 48, 199 
out degree centralityO, 201 
outdegree 

adjacency matrices, 70 
centrallty, 92, 201 
defined, 46 
dlrected graphs, 199 
networkx 2.0, 213 
reversmg, 202 
OverView tab in Gephi, 32 

P 

p-value, 173, 189 
PageRank, 35, 94-96 
pagerankO, 95-96 
painting project example, 122 
Pajek, xill, 30, 148 
Panama Papers case study, 
101-111, 176 
Pandas 

about, 71, 73 
binarizlng attrlbutes, 166 
centrallty measurements, 
96 

convertlng Panama Pa¬ 
pers CSV file, 108-111 
convertlng adjacency ma¬ 
trices, 71, 73-75 
coslne slmllarlty, 172 
cultural domaln analysls 
case study, 142-151 
modularlty-based commu- 
nltles, 137 
networkx 2.0, 214 
Pearson correlatlon, 174, 
180 

slmllarlty, 164 
verslon, xv 

paper and pencll sketchlng 
networks, 6-8 
parallel edges, 18-19 
parallellsm 

wlth graph-tool, 13 
wlth NetworKit, 15 
Pareto prlnclple, 57, 129, 147 


partltionlng 

cllque communltles and, 
135 

cosmetlcs case study, 
157 

modularlty-based commu- 
nlties, 136-138, 191 
networks Into communl¬ 
tles, 35, 88, 148-150, 
157 

networks Into connected 
components, 126-128 
nodes In bipartite net¬ 
works, 177 

term communltles, 148- 
150 

trauma types case study, 
191 

path length 

cUques, 131 
components, 131 
measurtng, 32, 35, 88-92 
path graphO, 78 
paths 

connected components, 
126 

cutoff for shortest, 109 
dlrected graphs, 201 
geodeslcs, 90 
measurlng, 32, 35, 88- 
92, 131 

synthetlc networks, 64, 
78 

as tralls, 89 

PDF files, savtng as, 32, 39 
Pearson correlatlon, 173, 
180, 182, 187-192 
pearsonrO, 173 
pecklng order, 203 
periphery 

core-perlpheral analysls, 
129 

cmst as, 130 
defined, 91, 129 
dlrected graphs, 201 
eccentrlclty, 91 
measurlng, 91, 201 
peripheryO method, 91 
pickle 

cultural domaln analysls 
case study, 143 
Importmg and exporting 
networks, 30 
plotO, 27 

plug-lns and Gephi, 32 
PNG files, savtng as, 32, 39 


power law dlstribution, 106, 
147 

powerlaw cluster graphO, 78 
pre-painting project example, 
122 

predecessors, 200, 202 
predecessorsO method, 200 
preferentlal attachment 

Barabasl-Albert graphs, 
65, 78, 106 
defined, 5 

glant connected compo¬ 
nent (GCC), 129 
network dynamlcs, 57 
Preview tab In Gephi, 32, 38 
product networks 

bipartite network exam¬ 
ple, 179 
eliques, 133 
cosmetlcs case study, 
153-160 
defined, 120 

food pantry example, 120 
Isolates, 125 
understandlng, 120-123 
projectedgraphO, 179 
projectlons 

bipartite networks, 178- 
183, 190 

event networks, 164 
properties, node, see at- 
tributes 

property maps, 15 
pseudographs, 18, 53 
pure bridges, 94 
pygraphviz, verslon, xv 
pyplot submodule, 26 
Pythagoras’Trousers, 171 
Python, verslon, xv 
python-louvain, 136 

0 

Qualtrlcs, 205 
quotient graphO, 214 

R_ 

R, iGraph support, 12 
radius, 91, 201 
radiusO method, 91 
random layout, 26-28 
random node sampltng, soclal 
networks, 59 
randomgraphO, 178 
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randomJayoutO, 26 
Read, Ronald C., 78 
read adjlistO, 30 
read edgelistO, 30 
read gexfO, 30 
read gmIO, 30 
read graphmIO, 30 
read pajekO, 30 
read pickleO, 30 
read yamIO, 30 

reclprocal mean distance and 
closeness centrallty, 93 
reflexive nodes, 2 
regularity and simple net- 
works, 4 
relabeI nodesO, 22 
relatlonshlps, see edges 
remove edgeO, 20 
remove edges fromO, 20, 22 
remove nodeO, 20 
remove_nodes_from(), 20 
renderers, Gephi, 39-40 
replaceO, 108 
resources for thls book 
code files, xv41 
networkx, 12, 15, 213 
Online communltles, xvil, 
12, 15 

reversal, 95, 202, 213 
reverseO, 95, 202, 213 
rings 

as bipartite network, 175 
defined, 4 

synthetlc networks, 64, 
78 

s 

sampllng 

random node sampllng, 
59 

snowballmg, 7, 41-44, 59 
savlng 

networks, 29-30 
unwanted nodes m ego 
networks, 55 
vlsuallzatlons In Gephi, 
32, 38-39 

vlsuallzatlons In networkx, 
26 

scale-free networks, 6 
scallng 

Fmchterman-Relngold 
layout In Gephi, 37 


networkx Issues, 27 
scale-free networks, 6 

SciPy 

about, 71, 73 
coslne slmUarlty, 172 
Hamming distance, 168 
Manhattan distance, 170 
Pearson correlatlon, 173, 
180 

verslon, xv 
search 

breadth-first search, 43 
iGraph, 13 
seed nodes 

Panama Papers case 
study, 109 

random sampllng wlth, 
59 

snowbalUng wlth, 7, 41 
.selecto, 13 

selectlon operator ([]), 23 
self-loops 

adjacency matrices, 69 
as cycle, 89 
deletlng edges, 45 
IdentUylng, 22 
Induced graphs, 137, 157 
merglng dupllcates and, 
33, 45 

networkx 2.0, 214 
removlng, 22, 45 
undlrected graphs, 18 
seIfloopO, 214 
seIfloop edgesO, 22 
semantlc domaln analysls, 
116 

semantlc networks 
asymmetrlc, 198 
cUques, 133 
defined, 116 

food fraud example, 116- 
118 

Isolates, 125 
Othello example, 118 
understandlng, 116-120 
sentlment analysls, 66 
Sephora cosmetlcs case 
study, 153-160 
serlallzatlon 

dlrected acycllc graphs, 
203 

wlth pickle, 143 
Series 

bulldlng, 75 
defined, 73 


importlng node at- 
trlbutes, 75 
Index, 146 

jolnlng In DataFrame, 146 
modularlty-based commu¬ 
nltles, 137 

term vector model, 146 
Series Index, 146 
set edge attributesO, 24, 214 
set_extent(), 27 
set node attributesO, 24, 214 
sets 

bipartite networks, 177, 
213 

frozen sets, 135 
speed and, 25 

shared memory multlprocess- 
Ing, 13 
sheU 

defined, 130 
deletlng aU nodes and 
edges whlle keepmg, 20 
separatlng, 129-130 
sheU layout, 26 
shellJayoutO, 26 
shortest pathO, 90 
shortest simple pathsO, 90 
slgned edges, 199 
slgned networks 

adjacency matrices, 69 
defined, 60 
vs. dlrected networks, 
199 

welght, 60, 71, 199 
SlmUarlty, 163-174, see al¬ 
so distance 

bipartite networks, 180- 
183 

convertlng slmllarltles to 
edges, 163 

coslne, 171-173, 187- 
192 

generallzed, 174, 181, 
187-192 

local topology and, 57 
matrix, 188 
measuring, 163-164, 
167-173 

Pearson correlatlon, 173, 
180, 182, 187-192 
trauma types case study, 
185-192 

understandlng, 163-167 
SlmUarlty matrix, 188 
similaritymtxO, 188 
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similarity_net(), 189 
simple networks, see also clas- 
slc networks 
adjacency matrices, 69 
clustertng coefficient, 88 
with networkx, 17-30 
regularity and, 4 
single_source_shortest_pathJength(), 
109 

slnks, 95 

srx degrees of separation, 
see small-world networks 
sketchrng networks by hand, 
6-8 

SlashDot, empiric network, 
61 

sliceprojectedO, 190 
sllcing 

bipartite networks, 180, 
183, 189 

cultural domaln analysls 
case study, 147 
defined, 80 
with graph-tool, 14 
Hamming dlstance, 169 
Isolates and, 126 
with networkx, 79-81 
product networks, 120 
slmllarity matrix, 189 
threshold, 80, 126, 147, 
180, 183, 189 

small-world networks, 6, 57, 
64, 78 

SNA, see soclal network 
analysls 
snowbaUlng 
defined, 7 

food and nutrient sketch 
example, 7 
soclal networks, 59 
Wlklpedla pages case 
study, 41-44 
Social and Economic Net¬ 
works, 92 
soclal capital, 57 
Social Network Analysis, 54, 
58 

soclal network analysls, see 
also soclal networks 
clustertng coefficient, 87 
core-peripheral analysls, 
129 

eccentriclty, 92 
hlstory, 5 
nelghbors, 84-88 


soclal networklng websltes 
empiric networks, 61 
vs. soclal networks, 54 
soclal networks 

acqulring, 59-60 
asymmetric, 197 
communlcation networks, 
61 

core-peripheral analysls, 
129 

defined, 5, 53, 57 
dlstmgulshtng strong and 
weak edges, 66 
examples, 5 

nelghborhoods measure- 
ment, 84-88 
Othello semantlc network, 
119 

path length, 89 
prepared, 61 
propertles table, 57 
slgned networks, 60 
vs. soclal networklng 
websltes, 54 
synthetlc networks, 63- 
66 

understandlng, 53-67 
soclocentric networks, 
see soclal networks 
soclograms, 5 
SOCR Data Dlnov 020108 
HelghtsWeights dataset, 
170 

sortrng, tuples, 42 
source nodes, 69, 90 
Southern women synthetlc 
network, see Davis South¬ 
ern women synthetlc net¬ 
work 

sparse matrices, 76, 147 
spectral layout, 26-28 
spectralJayoutO, 26 
speed 

calculatrng generallzed 
slmllarity In bipartite 
networks, 182-183 
eliques, 132 

component analysls, 128 
convertmg adjacency ma¬ 
trices, 71-72 
graph-tool, 13-14, 16 
IGraph, 12, 16 
llsts, 25, 132 
NetworKit, 13, 16 
networkx, 13, 15 


problems with pure 
Python, 210 
snowbaUlng, 43 
spring layout, 26-28 
springJayoutO, 26 
Stack Overflow forums, xvll, 
12, 15 

Stanford Large Network 
Dataset Collection, 61 
star_graph(), 78 
stars 

clustering coefficient, 87 
defined, 4 

preventlng with stop 
words, 42 

synthetlc networks, 64, 
78 

statlstlcs 

calculation tools with 
graph-tool, 14 
with GephI, 32, 35 
stemmrng, 145 
stop words 

cultural domaln analysls 
case study, 144-145 
preventlng stars when 
snowbaUlng, 42 
strongly_connectecl_component_sub- 
graphsO, 128 

strongly_connectecl_components(), 

127 

subgraphO, 46, 128, 213 
subgraphs 

connected components, 
128 

ego networks, 56 
In networkx 2.0, 213 
truncatlng network, 46 
subgraphs, complete, 
see eliques 

subordlnation, 197, 203 
subsets 

clustering, 88 
defined, 6 

substltutes, product net¬ 
works, 120 
successors, 200, 202 
successorsO method, 200 
supemodes, symthetlc, 138 
SurveyMonkey, 205 
SVG files, savlng as, 32, 39 
symmetry, 18 
SymPy, 71 
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synthetlc networks 
complex, 78 
generatlng, 61, 78 
regular, 78 

understandlng, 63-66 
synthetlc nodes 

blockmodellng with syn- 
thetlc supemodes, 138 
modularity-based commu- 
nltles, 137 
replaclng maximal 
eliques with, 133 

T 

target nodes, 69, 90 
technological networks, exam- 
ples, 5 

term communities, portionlng 
and nammg, 148-150 
term matrices, cultural do- 
maln analysis, 144 
term vector model (TVM), 146 
term vectors, 145 
terms 

cultural domam analysis, 
142, 144 

extracting and namlng 
term communities, 
148-150 

term llsts In cultural do- 
maln analysis, 142 
term vector model (TVM), 
146 

term vectors, 145 
tles, see edges 
tlmelines, deflned, 3, see al¬ 
so Abraham Lincoln time- 
Irne 

to_clict_of_lists(), 77 
to directedO, 213 
to edgelistO, 76 
to numpy matrixO, 72-73 
to pandas adjacencyO, 214 
to pandas dataframeO, 73, 214 
to pandas edgelistO, 214 
to undirectedO, 202, 213 
todenseO, 76 
tolistO, 72 

top nodes, bipartite networks, 
178-183 

topological sortO, 203, 206 


topology 

dlrected acycUc graphs, 
203-208 
examples, 57 

toposort module, xv, 204-208 
toposortO method, 207 
trails, 88, 201 

transitive closure, 5, 88, 203 
transitive_closure(), 203 
transitivityO, 88 
trauma types case study, 
185-192 
trees 

branching factor, 78 
deflned, 3 
stars, 4 

synthetlc networks, 64, 
78 

triO, 96 

triadic census, xv 
triadic closure, 57 
triads 

cUques, 132 
clustering coefflcient 
measurement, 87 
deflned, 6 

tripartite networks, examples, 
176 

truncatlng networks, 46 
tuples, sortlng, 42 
TVM (term vector model), 146 
Twitter, emplrlc network, 61 
two-mode networks, see bipar¬ 
tite networks 

u 

UCINET, 148 
unconnected graphs and 
snowbaUlng, 7 
undrrected graphs 

adjacency matrices, 69 
convertlng dlrected 
graphs to, 18, 128, 202 
creatmg, 18 
deflned, 18 
denslty, 84 
networkx 2.0, 213 
soclal networks, 53 
unit matrices, generatlng, 96 
United States Census Bureau 
State-to-State Mlgratlon 
Flows dataset, 199 


United States Department of 
Agrlculture (USDA), 121, 
166 

urilib.request module, 154 
USDA (United States Depart¬ 
ment of Agrlculture), 121, 
166 

V 

versions 

chartlng with networkx, 13 
modules used in thls 
book, XV 

networkx, xv, 208, 213 
Python, XV 
views 

with graph-tool, 14 
networkx 2.0, 213 
visuallzatlons 

classic networks, 2-4 
dlrected graphs, 199 
with Gephi, 31-40, 199 
graphviz, 28 

layout optlons, 26-29, 
32, 37-40 
layout phase, 26 
with matplotiib, 25-28 
rendertng phase, 26 
savlng in Gephi, 32, 38-39 
savlng in networkx, 26 
scalmg, 27 

size limits of networkx, 110 
sketchmg by hand, 6-8 
tools, 11-16 

w_ 

walks, 88, 201 
Watts-Strogatz graphs, 64, 
78 

weaklyconnectedcomponentsub- 
graphsO, 128 

weakIyconnectedcomponentsO, 

127 

welght 

adding welghted edges, 
24 

adjacency matrices, 69 
assumptions m networkx, 
71 

bipartite networks, 179- 
180, 189 
brldges, 66 
eliques, 66 

convertlng to dlctionaiy, 
147 

deflned, 24 

dlrected networks, 199 
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distlnguishlng strong and 
weak edges In social 
networks, 66 
Hamming distance, 169 
Incldence matrices, 76 
Induced graphs, 137 
negative, 116, 120, 126 
path length measure- 
ment, 89 

product networks, 120 
slgned networks, 60, 71, 
199 

sUcing welghted net¬ 
works, 79-81 
social networks, 53 
weighted_projected_graph(), 179 
whltespace and unlfymg du¬ 
plicate names, 108 


wikipedia module 
Importlng, 42 
verslon, xv 

Wlklpedla pages case study 
analysls, 47-48 
centrallty measurement, 
93-97 

clustering coefflclent, 87 
constmctlng network, 
41-48 
denslty, 84 

measurement, 46, 83- 
100 

nelghborhoods, 85 
path measurements, 88- 
92 

Wllson, Robln J., 78 
wlnd rose example of coslne 
distance, 171-174 
WordNet, 144 


wordpunctJokenizeO, 145 
write adjIistO, 30 
write edgelistO, 30 
write gexfO, 30 
write gmIO, 30 
write_graphml(), 30 
write pajekO, 30 
write_pickle(), 30 
write yamIO, 30 

Y 

YAML, Importlng and export- 
Ing, 30 

z 

Zachary’s Karate Club syn- 
thetlc network, 65, 79 
Zlpfs law, 146-147 
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More on Python 

For data Science and baslc Science, for you and anyone else on your team. 


Data Science Essentiais in Python 

Go from messy, unstructured artlfacts stored In SQL 
and NoSQL databases to a neat, well-organized dataset 
wlth thls qulck reference for the busy data sclentlst. 
Understand text mlnlng, machlne learnlng, and net- 
work analysls; process numerlc data wlth the NumPy 
and Pandas modules; describe and analyze data uslng 
statlstlcal and network-theoretlcal methods; and see 
actual examples of data analysls at work. Thls one- 
stop solutlon covers the essentlal data Science you 
need In Python. 

Dmltry Zlnovlev 

(224 pages) ISBN: 9781680501841. $29 
https://pragprog.com/bool</dzpyds 



Practical Programmingjhird Edition 


Classroom-tested by tens of thousands of students, 
thls new edltlon of the best-selllng Intro to program- 
mlng book Is for anyone who wants to understand 
computer sclence. Leam about deslgn, algorlthms, 
testlng, and debuggmg. Dlscover the fundamentals of 
programmlng wlth Pjdhon 3.6—a language thafs used 
In mllllons of devlces. Wrlte programs to solve real- 
world problems, and come away wlth eveiythlng you 
need to produce quallty code. Thls edltlon has been 
updated to use the new language features In I^thon 
3.6. 

Paul Grles, Jennlfer Campbell, Jason Montojo 
(410 pages) ISBN: 9781680502688. $49.95 
https://pragprog.com/bool</gwpy3 



Practical Programmlng 

Thtrd Edllluii An Introductloii Lo 


Computer Science 
Uslng Python 3.6 







Level Up 

From data stmctures to archltecture and design, we have what you need. 


A Common-Sense Guide to Data Stmctures and Algorithms 

If you last saw algorithms In a university course or at 
a job Intervlew, you’re mlssing out on what they can 
do for your code. Learn different sortlng and searchlng 
technlques, and when to use each. Find out how to 
use recurslon effectlvely. Discover structures for spe- 
clallzed appllcatlons, such as trees and graphs. Use 
Blg O notatlon to decide whlch algorithms are best for 
your productlon environment. Begmners wUl leam how 
to use these technlques from the start, and experlenced 
developers wlll redlscover approaches they may have 
forgotten. 

Jay Wengrow 

(218 pages) ISBN: 9781680502442. $45.95 
https://pragprog,com/book/jwclsal 


A Common-Sense Guide to 
Data Stmctures and Algorithms 

Level Up Your Core 
Programmlng SklUs 



Design It! 

Don’t englneer by colncldence—design It llke you mean 
It! Grounded by fundamentals and fllled wlth practlcal 
design methods, thls Is the perfect Introductlon to 
Software archltecture for programmers who are ready 
to grow thelr design skllls. Ask the right stakeholders 
the right questlons, explore design optlons, share your 
design declslons, and faclUtate collaboratlve workshops 
that are fast, effective, and fun. Become a better pro- 
grammer, leader, and designer. Use your new skllls to 
lead your team In Implementlng Software wlth the right 
capabllltles—and develop awesome Software! 

Mlchael Keellng 

(358 pages) ISBN: 9781680502091. $41.95 
https://pragprog.com/book/mkdsa 



Design It! 

^ From Programmer 

to Software Arcliltect 









The Modern Web 

Get up to speed on the latest HTML, CSS, and JavaScript technlques, and secure your Node 
appllcatlons. 


HTML5 and CSS3 (2nd edition) 

HTML5 and CSS3 are more than just buzzwords - 
they’re the foundatlon for today’s web applleatlons. 
This book gets you up to speed on the HTML5 elements 
and CSS3 features you can use rlght now In your cur¬ 
rent projects, wlth backwards eompatlble Solutions 
that ensure that you dorit leave users of older browsers 
behind. Thls new edition covers even more new fea¬ 
tures, Ineluding CSS anlmatlons, IndexedDB, and 
cllent-side valldatlons. 

Brlan P. Hogan 

(314 pages) ISBN: 9781937785598. $38 
https://pragprog.com/book/bhh52e 



HTML5 and CSS3 

Second Edition 


Level Up wiQi Today's 
Web Technologies 



Secure Your Node.js Web Application 


Cyber-criminals have your web appllcatlons In thelr 
crosshalrs. They search for and explolt common secu- 
rity mlstakes m your web appllcatlon to steal user data. 
Learn how you can secure your Node.js appllcatlons, 
database and web server to avold these security holes. 
Dlscover the primaiy attack vectors agalnst web appll¬ 
catlons, and Implement security best practlces and 
effective countermeasures. Codlng seeurely wlll make 
you a stronger web developer and analyst, and you’11 
protect your users. 

Karl Duiina 

(230 pages) ISBN: 9781680500851. $36 
https://pragprog.com/book/kdnodesec 












Long Livethe Command Line! 

Use tmux and Vim for incredible mouse-free productlvlty. 


tmux2 


Your mouse Is slowlng you down. The time you spend 
context swltchlng between your editor and your con¬ 
soles eats away at your productlvlty. Take control of 
your envlronment wlth tmux, a termlnal multlplexer 
that you can tallor to your workflow. Wlth thls updated 
second edltlon for tmux 2.3, you’ll customlze, sciipt, 
and leverage tmux’s unlque abllltles to craft a produc¬ 
tive termlnal envlronment that lets you keep your fln- 
gers on your keyboard’s horne row. 

Brlan P. Hogan 

(102 pages) ISBN: 9781680502213. $21.95 
https://pragprog.com/bool(/bhtmux2 


tmux 2 

Pi'oductive 

Mouse-Free 

Development 



Modern Vim 


Turn Vim Into a full-blown development envlronment 
uslng Vim 8’s new features and thls sequel to the 
beloved bestseller Practical Vim. Integrate your editor 
wlth tools for bulldlng, testlng, llntlng, Indexlng, and 
searchlng your eodebase. Dlscover the future of Vim 
wlth Neovlm: a fork of Vim that Includes a bullt-ln 
termlnal emulator that wlll transform your workflow. 
Whether you choose to swltch to Neovlm or stlek wlth 
Vim 8, youll be a better developer. 

Drew Nell 

(190 pages) ISBN: 9781680502626. $39.95 
https://pragprog.com/bool(/modvim 









Pastand Present 

To see where we’re golng, remember how we got here, and learn how to take a healthler 
approach to programmlng. 


FireintheValley 

In the 1970s, whlle their contemporaries were 
protestlng the computer as a tool of dehumanlzatlon 
and oppression, a motley collection of college dropouts, 
hipples, and electronlcs fanatlcs were engaged In 
somethlng much more subverslve. Obsessed wlth the 
idea of gettlng computer power Into their own hands, 
they launched from their garages a hobb}dst movement 
that grew into an Industry, and ultlmately a soclal and 
technologlcal revolution. What they did was Invent the 
personal computer: notJust a new devlce, but a water- 
shed in the relatlonshlp between man and machlne. 
Thls is their story. 

Mlchael Swalne and Paul Frelberger 
(422 pages) ISBN: 9781937785765. $34 
https://pragprog.com/book/fsfire 



Thlrd EdlUon 

Fire in the Valley 

The Birlh cuid DeaOi of ihe 
Personal Computer 



Mlchael Swalne and Paul Frelberger 



The Healthy Programmer 

To keep dolng what you love, you need to malntaln 
your own Systems, not Just the ones you wrlte code 
for. Regular exerclse and proper nutritlon help you 
leam, remember, concentrate, and be Creative—skllls 
critlcal to dolng your job well. Leam how to change 
your Work hablts, master exerclses that make worklng 
at a computer more comfortable, and develop a plan 
to keep fit, healthy, and sharp for years to come. 

This book is Intended only as an informative guide for 
those wishing to know more about Health issues. In no 
way is this book intended to replace, countermand, or 
conjlict with the advice given to you by your own 
Healthcare provider including Physician, JVurse Practi- 
tioner, Physician Assistant, Registered Dietician, and 
other licensed professionals. 


The 

Healthy 

Programmer 

GeL FIL. Feel BeLter, 
and Keep Codliig 



Joe Kutner 

(254 pages) ISBN: 9781937785314. $36 
https://pragprog.com/book/ikthp 











The Pragmatic Bookshelf 

The Pragmatic Bookshelf features books written by developers for developers. The tities 
continue the well-known Pragmatic Programmer style and continue to garner awards and 
rave revlews. As development gets more and more dlfflcult, the Pragmatic Programmers wlll 
be there wlth more tities and products to help you stay on top of your game. 

Visit Us Online 

This Book's Home Page 

https://pragprog.com/bool(/dzcnapy 

Source code from thls book, errata, and other resources. Come glve us feedback, too! 

Register for Updates 

https://pragprog.com/updates 

Be notlfled when updates and new books become avallable. 

Join the Community 

https://pragprog.com/community 

Read our weblogs, joln our onllne dlscusslons, participate In our malllng llst, Internet wlth 
our wlkl, and benefit from the experlence of other Pragmatic Programmers. 

New and Noteworthy 

https://pragprog.com/news 

Check out the latest pragmatic developments, new tities and other offerlngs. 


Buy the Book 

If you Ilked thls eBook, perhaps you’d llke to have a paper copy of the book. It’s avallable 
for purchase at our store: https-./Zpragprog.com/book/dzcnapy 


Contact Us 


Online Orders: 
Customer Service: 
International Rlghts: 
Academlc Use: 

Write for Us: 


https://pragprog.com/catalog 
support@pragprog.com 
translations@pragprog. com 
academic@pragprog.com 
http://write-for-us.pragprog.com 


+1 800-699-7764 


Or Call: 


